grafana / alloy

OpenTelemetry Collector distribution with programmable pipelines
https://grafana.com/oss/alloy
Apache License 2.0
1.39k stars 203 forks source link

Agent's reload endpoint keeps getting timeout #300

Open Ccccclong opened 11 months ago

Ccccclong commented 11 months ago

What's wrong?

The POST /-/reload endpoint which is called by config-reloader periodically would randomly starts to timeout, and all consequent calls to the endpoint would timeout indefinitely. Causing the grafana agent to not collect the logs from new pods.###

Steps to reproduce

The issue tends to happen more quickly if you manually calls the POST /-/reload endpoint frequently, e.g. 100 calls/second

System information

Ubuntu 22.04 x86_64

Software version

Grafana Agent Operator Helm Chart 0.2.15, on RKE2 Cluster 1.25.3

Configuration

# Helm values

resources:
  requests:
    cpu: '100m'
    memory: '128Mi'
  limits: {}

Logs

# From kubectl logs pod/monitoring-logs-gpfmf --container=grafana-agent

ts=2023-11-23T02:31:09.417638382Z caller=filetarget.go:229 level=debug component=logs logs_config=argocd/monitoring msg="no files matched requested path, nothing will be tailed" path=/var/log/pods/*60fd8ad1-bd19-42a7-be1e-53db0589d3cd/etcd/*.log pathExclude=
ts=2023-11-23T02:31:09.417716454Z caller=tailer.go:202 level=info component=logs logs_config=argocd/monitoring component=tailer msg="skipping update of position for a file which does not currently exist" path=/var/log/pods/kube-system_descheduler-28344144-gp2l6_2ec2b836-7580-473b-b590-b8797bd6c6cf/descheduler/0.log
ts=2023-11-23T02:31:09.417759773Z caller=filetarget.go:229 level=debug component=logs logs_config=argocd/monitoring msg="no files matched requested path, nothing will be tailed" path=/var/log/pods/*15d5d0d5-ee92-4879-9092-f23787dfb15e/cloud-controller-manager/*.log pathExclude=
rfratto commented 11 months ago

This is likely due to a deadlock. Once the request starts timing out, can you share a goroutine dump by going to /debug/pprof/goroutine?debug=1 and /debug/pprof/goroutine?debug=2 on the agent's container's HTTP server? Having both dumps (one has more info but can be more tedious to read) will be helpful for us to track it down.

Ccccclong commented 10 months ago

Sorry for the late response. This problem happens randomly and it hasn't appeared for the past weeks.

Additionally this log is from grafana agent helm chart version 0.3.11, I have recently upgraded it in hope that it will solve the issue.

Thanks for the help.

grafana-agent-debug-log-1.txt grafana-agent-debug-log-2.txt

mattdurham commented 9 months ago

Are you ensuring the previous reload was complete before calling the next?

github-actions[bot] commented 8 months ago

This issue has not had any activity in the past 30 days, so the needs-attention label has been added to it. If the opened issue is a bug, check to see if a newer release fixed your issue. If it is no longer relevant, please feel free to close this issue. The needs-attention label signals to maintainers that something has fallen through the cracks. No action is needed by you; your issue will be kept open and you do not have to respond to this comment. The label will be removed the next time this job runs if there is new activity. Thank you for your contributions!

rfratto commented 6 months ago

Hi there :wave:

On April 9, 2024, Grafana Labs announced Grafana Alloy, the spirital successor to Grafana Agent and the final form of Grafana Agent flow mode. As a result, Grafana Agent has been deprecated and will only be receiving bug and security fixes until its end-of-life around November 1, 2025.

To make things easier for maintainers, we're in the process of migrating all issues tagged variant/flow to the Grafana Alloy repository to have a single home for tracking issues. This issue is likely something we'll want to address in both Grafana Alloy and Grafana Agent, so just because it's being moved doesn't mean we won't address the issue in Grafana Agent :)