Adjust healthcheck for `prometheus_remote_write`

jszwedko commented 3 years ago

Broken off of https://github.com/timberio/vector/pull/8269#pullrequestreview-706306727

The prometheus_remote_write sink currently makes a GET request to the endpoint for the health check. This fails for Prometheus itself as its healthcheck URL is /-/healthy. The remote write protocol does not describe a health check mechanism so it is unlikely that there is an HTTP request that would work for all applications that support the remote write protocol. Instead, I think we may want to just check for connectivity.

cc/ @bruceg in case you know something I don't.

spencergilbert commented 3 years ago

I don't think we currently do this anywhere, but we could expose more configuration for healthcheck and provide a default (/-/healthy) and allow the user to change it if whatever target they're using has a different endpoint?

We'd want to check that other targets do have different endpoints, but I could see this being useful in other sinks as well (loki with Grafana Cloud not exposing the healthcheck endpoint).

bruceg commented 3 years ago

@jszwedko that matches my knowledge as well. The remote write protocol only has one action, and that is to write metrics. We could possibly submit a request with zero events in it, but I have no idea if that wouldn't cause an error due to being empty. We would have to play with that to determine if it is viable.

spencergilbert commented 3 years ago

I noted in Slack as well, but there will be differences in the endpoints per remote_write target - we may want to create a healthcheck.endpoint, I imagine that could be useful for other sinks as well

zamazan4ik commented 2 years ago

@spencergilbert @jszwedko @bruceg from your point of view, is it ok to add a custom healthcheck.endpoint support to the prometheus_remote_write? I think I can try to implement it. It defintely would be useful for the users.

spencergilbert commented 2 years ago

Talking with @bruceg I think we'd support a PR for that, it should also close https://github.com/vectordotdev/vector/issues/13890. I think healtcheck.endpoint is a suitable name for the option, and it could either default to None to re-use the endpoint configuration - or have a default value of that's a reasonable default.

spencergilbert commented 2 years ago

Another followup, talking with @jszwedko - perhaps healthcheck.path could be better if we're expecting the base of the address to stay the same and just vary the path, as it would save the user repeating most of the endpoint option.

Would be curious what your opinion is as a user @zamazan4ik.

zamazan4ik commented 2 years ago

Another followup, talking with @jszwedko - perhaps healthcheck.path could be better if we're expecting the base of the address to stay the same and just vary the path, as it would save the user repeating most of the endpoint option.

That is an interesting question. From the one hand, specifying only a path will help with a path duplication and would prevent possible configuration errors like "changed a remote uri but forgot to change a corresponding health check uri". From another hand, in real life possible some really tricky configuration. E.g. the remote uri and the corresponding healthcheck uri can be located on the different servers behind different reverse proxies. And in this case we will not be able to configure it, if only a path part can be changed for the healthcheck uri. Or even a user want to define their own dedicated health server, which implements some logic on a health calculation.

Since I did not hear before about so custom setups with different healthcheck uri, I guess we can start with a healthcheck.path way. Even if it is less flexible, it will reduce a chance of possible missconfigurations. Later, if will be the requests from users about adding more flexibility, we can think about it and add an additional option or refactor somehow an existing one.

Prajwalprakash3722 commented 6 months ago

This issue is still persisting, is there any alternatives, or hacky Fixes?

zamazan4ik commented 6 months ago

This issue is still persisting, is there any alternatives, or hacky Fixes?

AFAIK, no fixes yet in this field.

akuzia commented 1 month ago

This feature would make Vector one of the best tools for pushing metrics for edge/iot and other uses with limited drive space. Prometheus agent is kinda not function properly in that scenario

vectordotdev / vector

Adjust healthcheck for `prometheus_remote_write` #8279