Logs are not forwarded to Loki after target docker container is restarted

dhorbach commented 9 months ago

What's wrong?

After restarting one of containers from which logs are scrapped - new logs will not be processed. Grafana agent is installed via system package on Amazon Linux. Reload of service configuration doesn't help.

https://github.com/grafana/loki/issues/5259 might be relevant

Steps to reproduce

Restart one of containers with logs - "docker restart container-name"
Observer logs are created - "docker logs container-name"
Check Grafana loki source - no new logs after container is stopped
Restart agent - sudo systemctl restart grafana-agent-flow.service
Check Grafana loki source - new logs are pushed

System information

Linux ip-10-10-101-40.ec2.internal 6.1.66-93.164.amzn2023.x86_64 grafana/agent#1 SMP PREEMPT_DYNAMIC Tue Jan 2 23:50:53 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux

Software version

agent, version v0.39.0 (branch: HEAD, revision: 402672cb)

Configuration

logging {
        level = "warn"
}

local.file "api_key" {
        filename = "/api-key"
        is_secret = true
}

discovery.docker "containers" {
  host = "unix:///var/run/docker.sock"
}

discovery.relabel "default" {
  targets = discovery.docker.containers.targets

  rule {
    action = "labelmap"
    regex = "__meta_docker_container_label_([^_]+)"
  }
  rule {
    action = "replace"
    source_labels = ["__meta_docker_container_name"]
    target_label = "container"
  }
}

loki.source.docker "default" {
  host       = "unix:///var/run/docker.sock"
  targets    = discovery.relabel.default.output
  forward_to = [loki.write.local.receiver]
}

loki.write "local" {
  endpoint {
    url = "***"
    basic_auth {
        username = ***
        password = local.file.api_key.content
    }
  }
}

Logs

No response

tpaschalis commented 9 months ago

Thank you for the report @dhorbach, we'll look into this.

When we forked Promtail code into the Agent's components, we chose a slightly different way to schedule targets as tasks, and we might have missed something here.

rfratto commented 9 months ago

We've done some preliminary investigation into this, and found that the issue seems to be that there is no mechanism for a container to be re-tailed after the log stream closes.

This is composed of two smaller issues:

The target stops entirely after the container exits. However, because a restarted Docker container has the same container ID, this condition should likely be changed to when the container is removed.
The function called processLoop doesn't actually do any looping; this means once the log stream exited, we won't attempt to open a second one (i.e., in case the container restarts).

The reason this happens with Docker and not with Kubernetes is mainly because the target is the same after a container restarts with Docker, but this isn't true with Kubernetes, where the ID changes.

We haven't identified the right way to fix the bug yet, but we do have a fair amount of confidence that modifying those two pieces above will lead us to the fix.

fredsig commented 9 months ago

Just stumbled into this issue as well. I'm planning to replace node exporter, cAdvisor and promtail with Grafana Agent on hosts with docker discovery and this is a show stopper for now (as we want to capture logs after docker container crashes or normal restarts).

rafaelmagu commented 8 months ago

I believe this issue is what I'm encountering when the Docker hosts are replaced (as part of an autoscaling ECS cluster, for example). The Grafana Agent (flow mode) starts first, then Docker, and, until the agent is restarted, no logs flow. Strangely, metrics work fine, and there's nothing in the agent's logs to indicate it isn't working properly.

rfratto commented 6 months ago

Hi there :wave:

On April 9, 2024, Grafana Labs announced Grafana Alloy, the spirital successor to Grafana Agent and the final form of Grafana Agent flow mode. As a result, Grafana Agent has been deprecated and will only be receiving bug and security fixes until its end-of-life around November 1, 2025.

To make things easier for maintainers, we're in the process of migrating all issues tagged variant/flow to the Grafana Alloy repository to have a single home for tracking issues. This issue is likely something we'll want to address in both Grafana Alloy and Grafana Agent, so just because it's being moved doesn't mean we won't address the issue in Grafana Agent :)

grafana / alloy