grafana / agent

Vendor-neutral programmable observability pipelines.
https://grafana.com/docs/agent/
Apache License 2.0
1.56k stars 479 forks source link

Grafana Agent(Flow Mode) stops sending logs to loki #6868

Open flyerhawk opened 2 months ago

flyerhawk commented 2 months ago

What's wrong?

Periodically grafana-agent pods stop sending logs to Loki and need to be restarted to get them sending logs again.

Steps to reproduce

Sporadically occurs usually on high log level pods

System information

EKS 1.28

Software version

Grafana Agent 0.39.1 helm chart .31

Configuration

apiVersion: helm.toolkit.fluxcd.io/v2beta1
kind: HelmRelease
metadata:
  name: grafana-agent
spec:
  releaseName: grafana-agent
  chart:
    spec:
      chart: grafana-agent
      sourceRef:
        kind: HelmRepository
        name: artifactory-helm-repo
        namespace: flux-system
      version: "0.31.0"
  interval: 1h0m0s
  values:
    apiVersion: v1
    ## Global properties for image pulling override the values defined under `image.registry` and `configReloader.image.registry`.
    ## If you want to override only one image registry, use the specific fields but if you want to override them all, use `global.image.registry`
    global:
      image:
        registry: jfrog
      pullSecrets:
        - regcred

      # -- Security context to apply to the Grafana Agent pod.
      podSecurityContext: {}

    crds:
      # -- Whether to install CRDs for monitoring.
      create: true

    # Various agent settings.
    configReloader:
      # -- Enables automatically reloading when the agent config changes.
      enabled: true
      image:
        # -- Tag of image to use for config reloading.
        tag: v0.8.0
    agent:
      # -- Mode to run Grafana Agent in. Can be "flow" or "static".
      mode: 'flow'
      configMap:
      # -- Create a new ConfigMap for the config file.
        create: false

      clustering:
        # -- Deploy agents in a cluster to allow for load distribution. Only
        # applies when agent.mode=flow.
        enabled: false

      # -- Enables sending Grafana Labs anonymous usage stats to help improve Grafana
      # Agent.
      enableReporting: false
    image:
        tag: v0.39.0
    controller:
      # -- Type of controller to use for deploying Grafana Agent in the cluster.
      # Must be one of 'daemonset', 'deployment', or 'statefulset'.
      type: 'daemonset'

      # -- Number of pods to deploy. Ignored when controller.type is 'daemonset'.
      #replicas: 4

      # -- Annotations to add to controller.
      extraAnnotations: {}

      autoscaling:
        # -- Creates a HorizontalPodAutoscaler for controller type deployment.
        enabled: false
        # -- The lower limit for the number of replicas to which the autoscaler can scale down.
        minReplicas: 1
        # -- The upper limit for the number of replicas to which the autoscaler can scale up.
        maxReplicas: 5
        # -- Average CPU utilization across all relevant pods, a percentage of the requested value of the resource for the pods. Setting `targetCPUUtilizationPercentage` to 0 will disable CPU scaling.
        targetCPUUtilizationPercentage: 0
        # -- Average Memory utilization across all relevant pods, a percentage of the requested value of the resource for the pods. Setting `targetMemoryUtilizationPercentage` to 0 will disable Memory scaling.
        targetMemoryUtilizationPercentage: 80

Logs

From Grafana Agent....

Wait returned an error: context canceled"

2024-04-11 19:13:13.757 ts=2024-04-11T23:13:13.75746748Z level=info msg="tailer exited" target=apps-mmf2/fc-core-6bb8cc4995-wsp2d:fc-core component=loki.source.kubernetes.pod_logs

2024-04-11 19:13:13.757 ts=2024-04-11T23:13:13.757432913Z level=warn msg="tailer stopped; will retry" target=apps-mmf2/fc-core-6bb8cc4995-wsp2d:fc-core component=loki.source.kubernetes.pod_logs err="client rate limiter Wait returned an error: context canceled"

2024-04-11 19:13:13.734 ts=2024-04-11T23:13:13.734468227Z level=info msg="finished node evaluation" controller_id="" node_id=discovery.relabel.filtered_pod_logs duration=5.612367ms

2024-04-11 19:13:13.728 ts=2024-04-11T23:13:13.728808151Z level=info msg="finished node evaluation" controller_id="" node_id=discovery.relabel.pod_logs duration=15.128878ms

2024-04-11 19:13:13.565 ts=2024-04-11T23:13:13.565594699Z level=warn msg="could not determine if container terminated; will retry tailing" target=apps-mmf2/fc-core-6bb8cc4995-wsp2d:fc-core component=loki.source.kubernetes.pod_logs err="pods \"fc-core-6bb8cc4995-wsp2d\" not found"

2024-04-11 19:13:13.364 ts=2024-04-11T23:13:13.364645639Z level=warn msg="tailer stopped; will retry" target=apps-mmf2/fc-core-6bb8cc4995-dfw7x:fc-core component=loki.source.kubernetes.pod_logs err="pods \"fc-core-6bb8cc4995-dfw7x\" not found"

2024-04-11 19:13:13.277 ts=2024-04-11T23:13:13.277872904Z level=info msg="opened log stream" target=apps-mmf2/fc-core-6bb8cc4995-wsp2d:fc-core component=loki.source.kubernetes.pod_logs "start time"=2024-04-11T23:13:13.246Z

2024-04-11 19:13:13.245 ts=2024-04-11T23:13:13.245083471Z level=info msg="opened log stream" target=apps-mmf2/fc-core-6bb8cc4995-wsp2d:fc-core component=loki.source.kubernetes.pod_logs "start time"=2024-04-11T23:13:13.215Z

2024-04-11 19:13:13.243 ts=2024-04-11T23:13:13.243761946Z level=warn msg="tailer stopped; will retry" target=apps-mmf2/fc-core-6bb8cc4995-dfw7x:fc-core component=loki.source.kubernetes.pod_logs err="pods \"fc-core-6bb8cc4995-dfw7x\" not found"

2024-04-11 19:13:13.214 ts=2024-04-11T23:13:13.214541615Z level=info msg="opened log stream" target=apps-mmf2/fc-core-6bb8cc4995-wsp2d:fc-core component=loki.source.kubernetes.pod_logs "start time"=2024-04-11T23:13:13.187Z

2024-04-11 19:13:13.186 ts=2024-04-11T23:13:13.18667988Z level=info msg="opened log stream" target=apps-mmf2/fc-core-6bb8cc4995-wsp2d:fc-core component=loki.source.kubernetes.pod_logs "start time"=2024-04-11T23:13:12.313Z

2024-04-11 19:13:11.100 ts=2024-04-11T23:13:11.100140922Z level=error msg="final error sending batch" component=loki.write.grafana_cloud_loki component=client host=logs.xtops.ue1.eexchange.com status=400 tenant="" error="server returned HTTP status 400 Bad Request (400): entry for stream '{cluster=\"ufdc-eks01-1-28\", container=\"fc-core-adm\", env=\"eks-uat\", instance=\"apps-mmf2/fc-core-adm-68647f484b-wbxb9:fc-core-adm\", job=\"apps-mmf2/fc-core-adm-68647f484b-wbxb9\", namespace=\"apps-mmf2\", pod=\"fc-core-adm-68647f484b-wbxb9\", system=\"fc\"}' has timestamp too old: 2024-04-04T14:33:35Z, oldest acceptable timestamp is: 2024-04-04T23:13:11Z"

2024-04-11 19:13:08.767 ts=2024-04-11T23:13:08.767149881Z level=info msg="finished node evaluation" controller_id="" node_id=loki.source.kubernetes.pod_logs duration=32.254832ms

2024-04-11 19:13:08.734 ts=2024-04-11T23:13:08.734838869Z level=info msg="finished node evaluation" controller_id="" node_id=discovery.relabel.filtered_pod_logs duration=6.112792ms

2024-04-11 19:13:08.728 ts=2024-04-11T23:13:08.728672495Z level=info msg="finished node evaluation" controller_id="" node_id=discovery.relabel.pod_logs duration=15.306976ms

2024-04-11 19:13:06.588 ts=2024-04-11T23:13:06.588374162Z level=info msg="opened log stream" target=apps-etf2/fc-etf-core-5d8645898-xrzzv:fc-etf-core component=loki.source.kubernetes.pod_logs "start time"=2024-04-11T23:13:06.562Z

2024-04-11 19:13:06.563 ts=2024-04-11T23:13:06.562995551Z level=warn msg="tailer stopped; will retry" target=apps-etf2/fc-etf-core-5d8645898-xrzzv:fc-etf-core component=loki.source.kubernetes.pod_logs err="http2: response body closed"

2024-04-11 19:13:06.563 ts=2024-04-11T23:13:06.562911538Z level=info msg="have not seen a log line in 3x average time between lines, closing and re-opening tailer" target=apps-etf2/fc-etf-core-5d8645898-xrzzv:fc-etf-core component=loki.source.kubernetes.pod_logs rolling_average=2s time_since_last=6.476935385s

From a pod....

2024-04-11 19:13:13.230 unable to retrieve container logs for containerd://d21d6a29116eeea447bfc16543c1a7dead3cccd8116cce8beca64f70b6ee1537
2024-04-11 19:13:13.230 unable to retrieve container logs for containerd://d21d6a29116eeea447bfc16543c1a7dead3cccd8116cce8beca64f70b6ee1537
2024-04-11 19:13:13.211 unable to retrieve container logs for containerd://d21d6a29116eeea447bfc16543c1a7dead3cccd8116cce8beca64f70b6ee1537
2024-04-11 19:13:13.211 unable to retrieve container logs for containerd://d21d6a29116eeea447bfc16543c1a7dead3cccd8116cce8beca64f70b6ee1537
2024-04-11 19:13:13.211 unable to retrieve container logs for containerd://d21d6a29116eeea447bfc16543c1a7dead3cccd8116cce8beca64f70b6ee1537
2024-04-11 19:13:13.207 unable to retrieve container logs for containerd://d21d6a29116eeea447bfc16543c1a7dead3cccd8116cce8beca64f70b6ee1537
2024-04-11 19:13:13.205 unable to retrieve container logs for containerd://d21d6a29116eeea447bfc16543c1a7dead3cccd8116cce8beca64f70b6ee1537
2024-04-11 19:13:13.201 failed to try resolving symlinks in path "/var/log/pods/apps-mmf2_fc-core-6bb8cc4995-dfw7x_519eb8db-51d9-483b-8991-5e66c2c8b4ee/fc-core/0.log": lstat /var/log/pods/apps-mmf2_fc-core-6bb8cc4995-dfw7x_519eb8db-51d9-483b-8991-5e66c2c8b4ee/fc-core/0.log: no such file or directory
2024-04-11 19:13:13.179 failed to watch file "/var/log/pods/apps-mmf2_fc-core-6bb8cc4995-dfw7x_519eb8db-51d9-483b-8991-5e66c2c8b4ee/fc-core/0.log": no such file or directory
2024-04-11 19:13:13.178 failed to watch file "/var/log/pods/apps-mmf2_fc-core-6bb8cc4995-dfw7x_519eb8db-51d9-483b-8991-5e66c2c8b4ee/fc-core/0.log": no such file or directory
2024-04-11 19:13:13.177 failed to watch file "/var/log/pods/apps-mmf2_fc-core-6bb8cc4995-dfw7x_519eb8db-51d9-483b-8991-5e66c2c8b4ee/fc-core/0.log": no such file or directory
2024-04-11 19:13:13.177 failed to watch file "/var/log/pods/apps-mmf2_fc-core-6bb8cc4995-dfw7x_519eb8db-51d9-483b-8991-5e66c2c8b4ee/fc-core/0.log": no such file or directory
github-actions[bot] commented 1 month ago

This issue has not had any activity in the past 30 days, so the needs-attention label has been added to it. If the opened issue is a bug, check to see if a newer release fixed your issue. If it is no longer relevant, please feel free to close this issue. The needs-attention label signals to maintainers that something has fallen through the cracks. No action is needed by you; your issue will be kept open and you do not have to respond to this comment. The label will be removed the next time this job runs if there is new activity. Thank you for your contributions!