Open tpaschalis opened 2 years ago
Here's a short reproduction on main (currently on https://github.com/grafana/agent/commit/2f4b5b37ac470fa5b0ebecb039495d8d397b711e). I'm running go1.18 darwin/arm64
agent-local-config.yaml
but with the config name changed to 'default', otherwise the Agent integration complains).
server:
log_level: info
metrics: global: scrape_interval: 1m configs:
integrations: agent: instance: "foo"
Wait for at least one unsuccessful push to happen. It will emit logs like
ts=2022-04-20T11:47:25.870629Z caller=dedupe.go:112 agent=prometheus instance=1441107aae5b2bac39bf41c80aa47c88 component=remote level=warn remote_name=144110-88a234 url=http://localhost:9009/api/prom/push msg="Failed to send batch, retrying" err="Post \"http://localhost:9009/api/prom/push\": dial tcp [::1]:9009: connect: connection refused"
2. Change the config to include a 'real' remote_write endpoint, like
server: log_level: info
metrics: global: scrape_interval: 1m configs:
integrations: agent: instance: "foo"
Save and reload the config by calling the /-/reload endpoint. The command hangs, and the Agent panics after 60 seconds minutes with the stack trace presented above.
I tried with node_exporter, but couldn't get the same issue, so it may have to do with the way we set up the v2 Agent integration.
**Edit:** The nil pointer is on the `wal` field of the metrics instance. It might that there's not enough time to re-initialize the WAL after the reload, before an autoscrape kicks in?
**Edit 2:** Disabling autoscrape and thus giving it a moment to replay the WAL makes it work.
This issue has been automatically marked as stale because it has not had any activity in the past 30 days. The next time this stale check runs, the stale label will be removed if there is new activity. The issue will be closed in 7 days if there is no new activity. Thank you for your contributions!
I initially thought this was because ci == current
is true in the case that only the referenced metrics config changed (it's not part of the compare being done and probably should be). I confirmed for the provided config that this evaluates to true in the reload scenario.
I then ran a test where I updated the integration instance value from "agent" -> "agent2" and confirmed the code path changed in a debugger. It expectedly called w.stop()
, but the panic is still happening despite stopping the original worker.
I'm not sure root cause but I feel like that first compare is still wrong in how it works since it doesn't consider the referenced metrics config changing in addition to the integration config.
I've attempted a number of ways to try to stop the scraper during the reload with no success yet. The prometheus scrape loop that calls the following still executes after the instance returns nil.
I just saw @tpaschalis note above that it may be the new autoscraper starting too fast so I'll approach it from that angle next.
@erikbaranowski The issue here is likely due to a slow WAL replay time, where an autoscrape happens while a WAL instance is still replaying and isn't initialized yet.
I'm removing my assignment (for now) so I'm not blocking on this one. With PTO and other priorities this is temporarily on pause for me.
Hi there :wave:
On April 9, 2024, Grafana Labs announced Grafana Alloy, the spirital successor to Grafana Agent and the final form of Grafana Agent flow mode. As a result, Grafana Agent has been deprecated and will only be receiving bug and security fixes until its end-of-life around November 1, 2025.
To make things easier for maintainers, we're in the process of migrating all issues tagged variant/flow to the Grafana Alloy repository to have a single home for tracking issues. This issue is likely something we'll want to address in both Grafana Alloy and Grafana Agent, so just because it's being moved doesn't mean we won't address the issue in Grafana Agent :)
I was testing grafana/agent#1478 with a dummy remote_write endpoint, when the Agent suddenly panicked and shut down.
Incidentally, it was around the same time I was trying to reload a config file when
curl localhost:12345/-/reload
hanged; after a while the command returned and the Agent was down. Before and after that, all calls to/-/reload
worked fine, so I'm not sure whether it was related or an once-off occurence.I'll investigate this further as I wasn't able to reproduce this under the same conditions, but I'm leaving the stack trace here for future reference and for others.
Edit: as far as I could tell, this happened with the default
agent-local-config.yaml
file, as I was changing the dummy remote_write with my Grafana cloud stack's details.