prometheus-community / PushProx

Proxy to allow Prometheus to scrape through NAT etc.
Apache License 2.0
721 stars 133 forks source link

Client disappearing #10

Open Rudd-O opened 7 years ago

Rudd-O commented 7 years ago

GC of client that was still running took place out of nowhere:

level=error ts=2017-10-03T09:21:09.446332474Z caller=proxy.go:97 msg="Responded to /clients" client_count=1
level=info ts=2017-10-03T09:21:28.653546534Z caller=coordinator.go:179 msg="GC of clients completed" deleted=0 remaining=1
level=info ts=2017-10-03T09:22:28.653595779Z caller=coordinator.go:179 msg="GC of clients completed" deleted=0 remaining=1
level=info ts=2017-10-03T09:23:28.653617421Z caller=coordinator.go:179 msg="GC of clients completed" deleted=0 remaining=1
level=info ts=2017-10-03T09:24:28.653431992Z caller=coordinator.go:179 msg="GC of clients completed" deleted=0 remaining=1
level=info ts=2017-10-03T09:25:28.653501368Z caller=coordinator.go:179 msg="GC of clients completed" deleted=0 remaining=1
level=info ts=2017-10-03T09:26:28.653597843Z caller=coordinator.go:179 msg="GC of clients completed" deleted=1 remaining=0
level=info ts=2017-10-03T09:27:28.653382959Z caller=coordinator.go:179 msg="GC of clients completed" deleted=0 remaining=0

Client is still running.

Restarting proxy re-registers the client as the client retries.

Rudd-O commented 7 years ago

Disappears after 5 minutes:

level=info ts=2017-10-03T11:00:17.657654193Z caller=proxy.go:104 msg=Listening address=:8080
level=info ts=2017-10-03T11:00:18.079916119Z caller=coordinator.go:110 msg=WaitForScrapeInstruction fqdn=
level=error ts=2017-10-03T11:00:22.815177936Z caller=proxy.go:97 msg="Responded to /clients" client_count=1
level=info ts=2017-10-03T11:01:17.657898105Z caller=coordinator.go:179 msg="GC of clients completed" deleted=0 remaining=1
level=info ts=2017-10-03T11:02:17.657956693Z caller=coordinator.go:179 msg="GC of clients completed" deleted=0 remaining=1
level=info ts=2017-10-03T11:03:17.657939787Z caller=coordinator.go:179 msg="GC of clients completed" deleted=0 remaining=1
level=info ts=2017-10-03T11:04:17.657922462Z caller=coordinator.go:179 msg="GC of clients completed" deleted=0 remaining=1
level=info ts=2017-10-03T11:05:17.657942618Z caller=coordinator.go:179 msg="GC of clients completed" deleted=0 remaining=1
level=info ts=2017-10-03T11:06:17.657953134Z caller=coordinator.go:179 msg="GC of clients completed" deleted=1 remaining=0
Rudd-O commented 7 years ago

Whatever endpoint Prometheus did not scrape gets garbage-collected after five minutes. This means that a Prometheus outage of more than five minutes makes the proxy think the app has disappeared altogether.

Rudd-O commented 7 years ago

Can the GC thread clean up and close the connection so the client can reconnect if it's still alive? The client isn't getting any signal that the connection has been closed, and thus never attempts to reconnect.

conr commented 7 years ago

Thanks for highlighting and debugging this issue!

I'm working on a fix for this now to join the paths correctly and avoid this bug.

Rudd-O commented 7 years ago

Awesome, but this is the bug about the disappearing client. The improperly joined URL is #9.

Rudd-O commented 7 years ago

The stderr log shows the client disappearing, deleted=1 and remaining=0, but somehow when I try to scrape again, bam, the scrape worked. I am closing this but note that the message is misleading.

conr commented 7 years ago

Sorry about that. I meant to comment on issue #9 !

I'll take a look into this as well. Thanks for reporting!

toerb commented 6 years ago

I think this issue should be reopened because the current behaviour is not consistent. When prometheus is not scraping a client for a few minutes, the client will disappear from the /clients list until the client restarts or prometheus scrapes it again. Since /clients is often used to generate the scraping configuration for prometheus the disappeared clients will also be dropped from the configuration. So the only way to restore the system to a functional state is by restarting of the client.

In my understanding a client should only be dropped from the /clients list, when the client is unreachable or not running anymore.

fajfer commented 6 years ago

In my understanding a client should only be dropped from the /clients list, when the client is unreachable or not running anymore.

Exactly, now it's kinda useless compared to what Brian wrote about getting it off wget via cron. I have some blackbox exporters and they dissapear all the time

claytono commented 4 years ago

I'm seeing the same issue. It seems like the GC process should close the client connection if it removes it from the config.

brian-brazil commented 4 years ago

Do you want to send a PR?

claytono commented 4 years ago

Sure, I can give it a shot. I was just getting up to speed on the code. If you've got any suggestions what might be a good way to fit this in, I'd be happy to hear it. My thoughts so far were to try to cancel the context when removing a known client, or to try to renew the timestamp when keepalives are seen. I'm not yet sure how practical either approach is at this point.

brian-brazil commented 4 years ago

Probably better to keep it around if it's still working.