Panic when reloading with changed remote_write config

tpaschalis commented 2 years ago

I was testing grafana/agent#1478 with a dummy remote_write endpoint, when the Agent suddenly panicked and shut down.

Incidentally, it was around the same time I was trying to reload a config file when curl localhost:12345/-/reload hanged; after a while the command returned and the Agent was down. Before and after that, all calls to /-/reload worked fine, so I'm not sure whether it was related or an once-off occurence.

I'll investigate this further as I wasn't able to reproduce this under the same conditions, but I'm leaving the stack trace here for future reference and for others.

ts=2022-04-19T09:42:47.020614397Z caller=gokit.go:65 level=debug msg="GET /agent/api/v1/metrics/integrations/sd?instance=ebpf&integrations=ebpf (200) 181.043µs"
ts=2022-04-19T09:42:50.709134329Z caller=dedupe.go:112 agent=prometheus instance=1441107aae5b2bac39bf41c80aa47c88 component=remote level=error remote_name=144110-88a234 url=http://localhost:9009/api/prom/push msg="Failed to flush metadata"
ts=2022-04-19T09:42:50.70948193Z caller=dedupe.go:112 agent=prometheus instance=1441107aae5b2bac39bf41c80aa47c88 component=remote level=error remote_name=144110-88a234 url=http://localhost:9009/api/prom/push msg="non-recoverable error while sending metadata" count=178 err="context canceled"
ts=2022-04-19T09:42:50.709647309Z caller=dedupe.go:112 agent=prometheus instance=1441107aae5b2bac39bf41c80aa47c88 component=remote level=info remote_name=144110-88a234 url=http://localhost:9009/api/prom/push msg="Scraped metadata watcher stopped"
ts=2022-04-19T09:42:50.710576469Z caller=dedupe.go:112 agent=prometheus instance=1441107aae5b2bac39bf41c80aa47c88 component=remote level=info remote_name=144110-88a234 url=http://localhost:9009/api/prom/push msg="Remote storage stopped."
ts=2022-04-19T09:42:50.711054063Z caller=manager.go:271 level=info agent=prometheus msg="stopped instance" instance=1441107aae5b2bac39bf41c80aa47c88
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x18 pc=0x26d30ec]

goroutine 276 [running]:
github.com/grafana/agent/pkg/metrics/instance.(*Instance).Appender(0xc00024a100?, {0x52744b0?, 0xc000696800?})
    /root/agent/pkg/metrics/instance/instance.go:568 +0x2c
github.com/grafana/agent/pkg/integrations/v2/autoscrape.(*agentAppender).Appender(0xc00014bce0, {0x52744b0, 0xc000696800})
    /root/agent/pkg/integrations/v2/autoscrape/autoscrape.go:248 +0xa3
github.com/prometheus/prometheus/scrape.newScrapePool.func2.3({0x52744b0?, 0xc000696800?})
    /root/go/pkg/mod/github.com/grafana/prometheus@v1.8.2-0.20220413182558-6b32d0b957c5/scrape/scrape.go:311 +0x35
github.com/prometheus/prometheus/scrape.(*scrapeLoop).scrapeAndReport(0xc0007bf380, {0xed9f07abf?, 0x7746400?, 0x7746400?}, {0xed9f07abf?, 0x7746400?, 0x7746400?}, 0x0)
    /root/go/pkg/mod/github.com/grafana/prometheus@v1.8.2-0.20220413182558-6b32d0b957c5/scrape/scrape.go:1257 +0x3bc
github.com/prometheus/prometheus/scrape.(*scrapeLoop).run(0xc0007bf380, 0xc000bda480?)
    /root/go/pkg/mod/github.com/grafana/prometheus@v1.8.2-0.20220413182558-6b32d0b957c5/scrape/scrape.go:1216 +0x345
created by github.com/prometheus/prometheus/scrape.(*scrapePool).sync
    /root/go/pkg/mod/github.com/grafana/prometheus@v1.8.2-0.20220413182558-6b32d0b957c5/scrape/scrape.go:587 +0xa1e

Edit: as far as I could tell, this happened with the default agent-local-config.yaml file, as I was changing the dummy remote_write with my Grafana cloud stack's details.

tpaschalis commented 2 years ago

Here's a short reproduction on main (currently on https://github.com/grafana/agent/commit/2f4b5b37ac470fa5b0ebecb039495d8d397b711e). I'm running go1.18 darwin/arm64

Checkout and build the Agent. Run it with integrations-next enabled and the following config (basically agent-local-config.yaml but with the config name changed to 'default', otherwise the Agent integration complains).
```
server:
log_level: info
```

metrics: global: scrape_interval: 1m configs:

name: default host_filter: false scrape_configs:
- job_name: local_scrape static_configs:
  - targets: ['127.0.0.1:12345'] labels: cluster: 'localhost' remote_write:
- url: http://localhost:9009/api/prom/push

integrations: agent: instance: "foo"


Wait for at least one unsuccessful push to happen. It will emit logs like

ts=2022-04-20T11:47:25.870629Z caller=dedupe.go:112 agent=prometheus instance=1441107aae5b2bac39bf41c80aa47c88 component=remote level=warn remote_name=144110-88a234 url=http://localhost:9009/api/prom/push msg="Failed to send batch, retrying" err="Post \"http://localhost:9009/api/prom/push\": dial tcp [::1]:9009: connect: connection refused"


2. Change the config to include a 'real' remote_write endpoint, like

server: log_level: info

metrics: global: scrape_interval: 1m configs:

name: default host_filter: false scrape_configs:
- job_name: local_scrape static_configs:
  - targets: ['127.0.0.1:12345'] labels: cluster: 'localhost' remote_write:
    - url: https://prometheus-prod-01-eu-west-0.grafana.net/api/prom/push basic_auth: username: 297344 password: eyJ

integrations: agent: instance: "foo"



Save and reload the config by calling the /-/reload endpoint. The command hangs, and the Agent panics after 60 seconds minutes with the stack trace presented above.

I tried with node_exporter, but couldn't get the same issue, so it may have to do with the way we set up the v2 Agent integration.

**Edit:** The nil pointer is on the `wal` field of the metrics instance. It might that there's not enough time to re-initialize the WAL after the reload, before an autoscrape kicks in?

**Edit 2:** Disabling autoscrape and thus giving it a moment to replay the WAL makes it work.

github-actions[bot] commented 2 years ago

This issue has been automatically marked as stale because it has not had any activity in the past 30 days. The next time this stale check runs, the stale label will be removed if there is new activity. The issue will be closed in 7 days if there is no new activity. Thank you for your contributions!

erikbaranowski commented 1 year ago

I initially thought this was because ci == current is true in the case that only the referenced metrics config changed (it's not part of the compare being done and probably should be). I confirmed for the provided config that this evaluates to true in the reload scenario.

https://github.com/grafana/agent/blob/cc2671502b6452005abbdbf31e9a6645069a67b1/pkg/integrations/v2/workers.go#L47

I then ran a test where I updated the integration instance value from "agent" -> "agent2" and confirmed the code path changed in a debugger. It expectedly called w.stop(), but the panic is still happening despite stopping the original worker.

https://github.com/grafana/agent/blob/cc2671502b6452005abbdbf31e9a6645069a67b1/pkg/integrations/v2/workers.go#L53

I'm not sure root cause but I feel like that first compare is still wrong in how it works since it doesn't consider the referenced metrics config changing in addition to the integration config.

erikbaranowski commented 1 year ago

I've attempted a number of ways to try to stop the scraper during the reload with no success yet. The prometheus scrape loop that calls the following still executes after the instance returns nil.

https://github.com/grafana/agent/blob/cc2671502b6452005abbdbf31e9a6645069a67b1/pkg/integrations/v2/autoscrape/autoscrape.go#L246

I just saw @tpaschalis note above that it may be the new autoscraper starting too fast so I'll approach it from that angle next.

rfratto commented 1 year ago

@erikbaranowski The issue here is likely due to a slow WAL replay time, where an autoscrape happens while a WAL instance is still replaying and isn't initialized yet.

erikbaranowski commented 1 year ago

I'm removing my assignment (for now) so I'm not blocking on this one. With PTO and other priorities this is temporarily on pause for me.

rfratto commented 7 months ago

Hi there :wave:

On April 9, 2024, Grafana Labs announced Grafana Alloy, the spirital successor to Grafana Agent and the final form of Grafana Agent flow mode. As a result, Grafana Agent has been deprecated and will only be receiving bug and security fixes until its end-of-life around November 1, 2025.

To make things easier for maintainers, we're in the process of migrating all issues tagged variant/flow to the Grafana Alloy repository to have a single home for tracking issues. This issue is likely something we'll want to address in both Grafana Alloy and Grafana Agent, so just because it's being moved doesn't mean we won't address the issue in Grafana Agent :)

grafana / alloy

Panic when reloading with changed remote_write config #542