grafana / alloy

OpenTelemetry Collector distribution with programmable pipelines
https://grafana.com/oss/alloy
Apache License 2.0
1.35k stars 187 forks source link

agent crashes if kafka_exporter fails to connect to kafka #403

Open 7840vz opened 1 year ago

7840vz commented 1 year ago

What's wrong?

Agent stops immediately if agent fails to connect to kafka on start.

I think this is not right behavior. Because config is valid, and it is just networking issue, other metrics/integrations/log collector should continue working while kafka_integration should try to reconnect instead of killing the agent.

Steps to reproduce

Add config like this:

integrations:
  kafka_exporter:
    enabled: true
    kafka_uris:
      - localhost:9092
    scrape_integration: true
    scrape_interval: 15s

System information

Linux

Software version

v0.34.3

Configuration

No response

Logs

Jul 21 13:35:52 mon-1 systemd[1]: grafana-agent.service: Main process exited, code=exited, status=1/FAILURE
Jul 21 13:35:52 mon-1 systemd[1]: grafana-agent.service: Failed with result 'exit-code'.
Jul 21 13:35:52 mon-1 systemd[1]: grafana-agent.service: Scheduled restart job, restart counter is at 5.
Jul 21 13:35:52 mon-1 systemd[1]: Stopped Grafana Agent.
Jul 21 13:35:52 mon-1 systemd[1]: grafana-agent.service: Start request repeated too quickly.
Jul 21 13:35:52 mon-1 systemd[1]: grafana-agent.service: Failed with result 'exit-code'.
Jul 21 13:35:52 mon-1 systemd[1]: Failed to start Grafana Agent.
Jul 21 13:36:20 mon-1 systemd[1]: Started Grafana Agent.
Jul 21 13:36:21 mon-1 grafana-agent[34170]: ts=2023-07-21T13:36:21.717477616Z caller=exporter.go:214 level=error integration=kafka_exporter msg="Error initiating kafka client: %s" err="kafka: client has run out of available brokers to talk to: dial tcp 127.0.0.1:9092: connect: connection refused"
Jul 21 13:36:21 mon-1 grafana-agent[34170]: ts=2023-07-21T13:36:21.719833625Z caller=manager.go:261 level=error msg="failed to initialize integration. it will not run or be scraped" integration=kafka_exporter err="could not instantiate kafka lag exporter: kafka: client has run out of available brokers to talk to: dial tcp 127.0.0.1:9092: connect: connection refused"
Jul 21 13:36:21 mon-1 grafana-agent[34170]: ts=2023-07-21T13:36:21.723370642Z caller=main.go:72 level=error msg="error creating the agent server entrypoint" err="failed applying config: not all integrations were correctly updated"
Jul 21 13:36:21 mon-1 systemd[1]: grafana-agent.service: Main process exited, code=exited, status=1/FAILURE
Jul 21 13:36:21 mon-1 systemd[1]: grafana-agent.service: Failed with result 'exit-code'.
marctc commented 1 year ago

Thanks for reporting @7840vz. That's indeed an undesired behavior and I suspect it might happen in other integrations. In this case, I was able to reproduce it and it's either that it an issue that has to be fixed upstream or it's solved in a version that we are not using. Do you see the same behavior if you run the exporter manually?

7840vz commented 1 year ago

Sorry, I haven't tried to use exporter separately.

If I run exporter separately and it does the same behaviour, I can at least setup exporter's container or systemd service to always restart after crash, and it ok, since exporter doesn't do anything else. Here it wouldn't help, as I need agent's other integrations/scrapes to be working, while kafka is not available....

marctc commented 1 year ago

I have manually checked other integrations if they have similar behavior but doesn't seem to be the case (which is good). I'm pretty sure that including last kafka changes this probably would get fixed. The work to do that is ongoing and can be tracked here grafana/alloy#464

github-actions[bot] commented 1 year ago

This issue has not had any activity in the past 30 days, so the needs-attention label has been added to it. If the opened issue is a bug, check to see if a newer release fixed your issue. If it is no longer relevant, please feel free to close this issue. The needs-attention label signals to maintainers that something has fallen through the cracks. No action is needed by you; your issue will be kept open and you do not have to respond to this comment. The label will be removed the next time this job runs if there is new activity. Thank you for your contributions!

marctc commented 11 months ago

I double checked and the problem happens upstream: the exporter basically crashes if kafka_exporter can't connect to a kafka broker. @7840vz i'd suggest to raise the issue to the original exporter to see if they are up for a fix.

BurningDog commented 11 months ago

I had the same issue when attempting to connect to caddy via scrape_configs. I'd done some reconfiguring with networks in my docker-compose file and forgot to restart the caddy service, so couldn't reach it any more.

The error message - level=error msg="error creating the agent server entrypoint" err="failed applying config: not all integrations were correctly updated" is unhelpful. The only way to debug a network issue inside the container is to remove each integration one at a time and restart the container.

rfratto commented 6 months ago

Hi there :wave:

On April 9, 2024, Grafana Labs announced Grafana Alloy, the spirital successor to Grafana Agent and the final form of Grafana Agent flow mode. As a result, Grafana Agent has been deprecated and will only be receiving bug and security fixes until its end-of-life around November 1, 2025.

To make things easier for maintainers, we're in the process of migrating all issues tagged variant/flow to the Grafana Alloy repository to have a single home for tracking issues. This issue is likely something we'll want to address in both Grafana Alloy and Grafana Agent, so just because it's being moved doesn't mean we won't address the issue in Grafana Agent :)