Open Rudd-O opened 7 years ago
Disappears after 5 minutes:
level=info ts=2017-10-03T11:00:17.657654193Z caller=proxy.go:104 msg=Listening address=:8080
level=info ts=2017-10-03T11:00:18.079916119Z caller=coordinator.go:110 msg=WaitForScrapeInstruction fqdn=
level=error ts=2017-10-03T11:00:22.815177936Z caller=proxy.go:97 msg="Responded to /clients" client_count=1
level=info ts=2017-10-03T11:01:17.657898105Z caller=coordinator.go:179 msg="GC of clients completed" deleted=0 remaining=1
level=info ts=2017-10-03T11:02:17.657956693Z caller=coordinator.go:179 msg="GC of clients completed" deleted=0 remaining=1
level=info ts=2017-10-03T11:03:17.657939787Z caller=coordinator.go:179 msg="GC of clients completed" deleted=0 remaining=1
level=info ts=2017-10-03T11:04:17.657922462Z caller=coordinator.go:179 msg="GC of clients completed" deleted=0 remaining=1
level=info ts=2017-10-03T11:05:17.657942618Z caller=coordinator.go:179 msg="GC of clients completed" deleted=0 remaining=1
level=info ts=2017-10-03T11:06:17.657953134Z caller=coordinator.go:179 msg="GC of clients completed" deleted=1 remaining=0
Whatever endpoint Prometheus did not scrape gets garbage-collected after five minutes. This means that a Prometheus outage of more than five minutes makes the proxy think the app has disappeared altogether.
Can the GC thread clean up and close the connection so the client can reconnect if it's still alive? The client isn't getting any signal that the connection has been closed, and thus never attempts to reconnect.
Thanks for highlighting and debugging this issue!
I'm working on a fix for this now to join the paths correctly and avoid this bug.
Awesome, but this is the bug about the disappearing client. The improperly joined URL is #9.
The stderr log shows the client disappearing, deleted=1 and remaining=0, but somehow when I try to scrape again, bam, the scrape worked. I am closing this but note that the message is misleading.
Sorry about that. I meant to comment on issue #9 !
I'll take a look into this as well. Thanks for reporting!
I think this issue should be reopened because the current behaviour is not consistent. When prometheus is not scraping a client for a few minutes, the client will disappear from the /clients list until the client restarts or prometheus scrapes it again. Since /clients is often used to generate the scraping configuration for prometheus the disappeared clients will also be dropped from the configuration. So the only way to restore the system to a functional state is by restarting of the client.
In my understanding a client should only be dropped from the /clients list, when the client is unreachable or not running anymore.
In my understanding a client should only be dropped from the /clients list, when the client is unreachable or not running anymore.
Exactly, now it's kinda useless compared to what Brian wrote about getting it off wget via cron. I have some blackbox exporters and they dissapear all the time
I'm seeing the same issue. It seems like the GC process should close the client connection if it removes it from the config.
Do you want to send a PR?
Sure, I can give it a shot. I was just getting up to speed on the code. If you've got any suggestions what might be a good way to fit this in, I'd be happy to hear it. My thoughts so far were to try to cancel the context when removing a known client, or to try to renew the timestamp when keepalives are seen. I'm not yet sure how practical either approach is at this point.
Probably better to keep it around if it's still working.
GC of client that was still running took place out of nowhere:
Client is still running.
Restarting proxy re-registers the client as the client retries.