`loki.source.file` restarts all tail routines

Jeinhaus commented 1 year ago

What's wrong?

We use the loki.source.file component to read pod logs in our EKS cluster. The config (see below) is similar to the setup described in the documentation. So far, it works great. But we noticed, that whenever a pod is created or destroyed in the cluster, the Grafana agent will stop all existing tail processes and start them again (see logs below). This produces (depending on the number of pods in the cluster, I guess) around 1k log lines per pod change.

Is this intended behavior, that all tail processes are restarted, or did we misconfigure something? If it's intended behaviour, we might just not log info log messages to avoid the high log volume.

Steps to reproduce

Start or stop a new pod in the cluster and observe the log-volume of grafana-agent.

System information

5.10.186-179.751.amzn2.x86_64

Software version

grafana/agent:v0.36.1

Configuration

logging {
  level  = "info"
  format = "logfmt"
}

discovery.kubernetes "pods" {
  role = "pod"
}

discovery.relabel "pods" {
  targets = discovery.kubernetes.pods.targets

  rule {
    source_labels = ["__meta_kubernetes_pod_controller_name"]
    regex = "([0-9a-z-.]+?)(-[0-9a-f]{8,10})?"
    action = "replace"
    target_label = "__tmp_controller_name"
  }

  rule {
    source_labels = [
      "__meta_kubernetes_pod_label_app_kubernetes_io_name",
      "__meta_kubernetes_pod_label_app",
      "__tmp_controller_name",
      "__meta_kubernetes_pod_name",
    ]
    // This matches the first of the above labels that is not empty
    // and writes the value to target_label.
    regex = "^;*([^;]+)(;.*)?$"
    action = "replace"
    target_label = "app"
  }

  rule {
    source_labels = [
      "__meta_kubernetes_pod_label_app_kubernetes_io_instance",
      "__meta_kubernetes_pod_label_instance",
    ]
    // This matches the first of the above labels that is not empty
    // and writes the value to target_label.
    regex = "^;*([^;]+)(;.*)?$"
    action = "replace"
    target_label = "instance"
  }

  rule {
    source_labels = [
      "__meta_kubernetes_pod_label_app_kubernetes_io_component",
      "__meta_kubernetes_pod_label_component",
    ]
    // This matches the first of the above labels that is not empty
    // and writes the value to target_label.
    regex = "^;*([^;]+)(;.*)?$"
    action = "replace"
    target_label = "component"
  }

  rule {
    source_labels = ["__meta_kubernetes_namespace"]
    target_label = "namespace"
  }

  rule {
    source_labels = ["__meta_kubernetes_pod_name"]
    target_label = "pod"
  }

  rule {
    source_labels = ["__meta_kubernetes_pod_container_name"]
    target_label = "container"
  }

  rule {
    source_labels = ["__meta_kubernetes_pod_node_name"]
    target_label = "node_name"
  }

  rule {
    source_labels = ["namespace", "app"]
    target_label  = "job"
    separator     = "/"
  }

  rule {
    source_labels = [
      "__meta_kubernetes_pod_uid",
      "__meta_kubernetes_pod_container_name",
    ]
    target_label  = "__path__"
    separator     = "/"
    replacement   = "/var/log/pods/*$1/*.log"
  }
}

// loki.source.file does not support glob expressions, so we need local.file_match
local.file_match "pods" {
  path_targets = discovery.relabel.pods.output
}

loki.source.file "pods" {
  targets    = local.file_match.pods.targets
  forward_to = [loki.relabel.default.receiver]
}

loki.relabel "default" {
  // We should drop this label, because it will be rarely used for querying
  // and keeping dynamic label count low improves performance, see
  // https://grafana.com/docs/loki/latest/best-practices/#use-dynamic-labels-sparingly
  rule {
    action = "labeldrop"
    regex = "filename"
  }

  forward_to = [loki.process.default.receiver]
}

loki.process "default" {
  forward_to = [loki.write.default.receiver]

  stage.cri {}
}

// This gathers Kubernetes events and writes them to loki.
// You can query for them using e.g.
// {job="loki.source.kubernetes_events"} |= `` | logfmt
loki.source.kubernetes_events "default" {
  namespaces = []
  forward_to = [loki.write.default.receiver]
}

loki.write "default" {
  endpoint {
    url = "http://loki-gateway:80/loki/api/v1/push"
  }
}

Logs

2023-09-07 10:40:55.642 
ts=2023-09-07T08:40:55.642655392Z component=loki.source.file.pods level=info msg="Seeked /var/log/pods/kube-system_efs-csi-node-htnnd_b866d65d-ee74-4ec4-b2c1-26775a373f1b/csi-driver-registrar/0.log - &{Offset:1638 Whence:0}"    
2023-09-07 10:40:55.642 
ts=2023-09-07T08:40:55.642147593Z level=info component=loki.source.file.pods component=tailer msg="tail routine: started" path=/var/log/pods/kube-system_efs-csi-node-htnnd_b866d65d-ee74-4ec4-b2c1-26775a373f1b/efs-plugin/0.log   
2023-09-07 10:40:55.642 
ts=2023-09-07T08:40:55.642115103Z component=loki.source.file.pods level=info msg="Seeked /var/log/pods/kube-system_efs-csi-node-htnnd_b866d65d-ee74-4ec4-b2c1-26775a373f1b/efs-plugin/0.log - &{Offset:1420 Whence:0}"  
2023-09-07 10:40:55.641 
ts=2023-09-07T08:40:55.641885169Z level=info component=loki.source.file.pods component=tailer msg="tail routine: started" path=/var/log/pods/kube-system_ebs-csi-node-hk4mk_a955ed23-177f-4190-8c09-74f5698b0f39/liveness-probe/0.log   
2023-09-07 10:40:55.641 
ts=2023-09-07T08:40:55.641793327Z component=loki.source.file.pods level=info msg="Seeked /var/log/pods/kube-system_ebs-csi-node-hk4mk_a955ed23-177f-4190-8c09-74f5698b0f39/liveness-probe/0.log - &{Offset:364 Whence:0}"   
2023-09-07 10:40:55.641 
ts=2023-09-07T08:40:55.641494872Z level=info component=loki.source.file.pods component=tailer msg="tail routine: started" path=/var/log/pods/kube-system_ebs-csi-node-hk4mk_a955ed23-177f-4190-8c09-74f5698b0f39/node-driver-registrar/0.log

...
...
...

2023-09-07 10:40:55.627 
ts=2023-09-07T08:40:55.627195325Z level=info component=loki.source.file.pods component=tailer msg="position timer: exited" path=/var/log/pods/monitoring_grafana-agent-node-qn2zk_93709431-2cf5-4a15-af5a-64409b63797b/config-reloader/0.log    
2023-09-07 10:40:55.627 
ts=2023-09-07T08:40:55.627121824Z level=info component=loki.source.file.pods component=tailer msg="stopped tailing file" path=/var/log/pods/kube-system_aws-node-ddqsq_2cc1f045-5d53-416d-a50d-f99dc0cb29bc/aws-vpc-cni-init/0.log  
2023-09-07 10:40:55.627 
ts=2023-09-07T08:40:55.627058303Z level=info component=loki.source.file.pods component=tailer msg="tail routine: exited" path=/var/log/pods/kube-system_aws-node-ddqsq_2cc1f045-5d53-416d-a50d-f99dc0cb29bc/aws-vpc-cni-init/0.log

tpaschalis commented 1 year ago

Hey there 👋

Yes, whenever loki.file_match returns a new set of targets, the loki.source.file component will restart all of its readers. We've have had a TODO to look into this for a while now.

Could you share a CPU profile to see if there's any impact on performance on this, or is it just the extra logs that are being annoying?

Jeinhaus commented 1 year ago

Hi, thanks for the quick response. Good to know, that you already have a TODO for this.

Sorry, I don't have a CPU profile ready. The performance impact is very minimal (if at all) as far as I can see from our Prometheus metrics.

For now, I think we'll just live with the extra logs or might increase the log level if the log volume is too much.

Thank you for your help.

rfratto commented 1 year ago

Given that this is currently expected behavior, I'm changing this from a bug to an enhancement to remove the annoying log volume.

github-actions[bot] commented 1 year ago

This issue has not had any activity in the past 30 days, so the needs-attention label has been added to it. If the opened issue is a bug, check to see if a newer release fixed your issue. If it is no longer relevant, please feel free to close this issue. The needs-attention label signals to maintainers that something has fallen through the cracks. No action is needed by you; your issue will be kept open and you do not have to respond to this comment. The label will be removed the next time this job runs if there is new activity. Thank you for your contributions!

aglees commented 5 months ago

Any movement on this enhancement? Setting the logging level to warn isn't great, else I think we're into dropping these logs on Alloy.

grafana / alloy