honeycombio / helm-charts

Helm repository and charts for Honeycomb
Apache License 2.0
30 stars 39 forks source link

Do we need `/x/alive` for Refinery readiness check? #150

Closed bixu closed 2 years ago

bixu commented 2 years ago

https://github.com/honeycombio/helm-charts/blob/555231a5ffaf59c815397007a48726f434b81132/charts/refinery/templates/deployment.yaml#L84

I'm comparing the line above to the docs: https://docs.honeycomb.io/manage-data-volume/refinery/scale-and-troubleshoot/#xalive

MikeGoldsmith commented 2 years ago

~The Refinery API only has the /alive endpoint. The docs should be updated to remove the /x/alive endpoint.~

The /x/alive endpoint is proxied to the Honeycomb API, so allows verification on whether the Refinery cluster can communicate with the Honeycomb service.

I'm unsure if it's a good idea to use a proxy process as the verification process for whether a node is considered available. Refinery nodes are designed to recover from intermittent network outages.

MikeGoldsmith commented 2 years ago

For now, I think using the /x/alive endpoint is not a good idea. We have seen refinery struggle to cope with irregular cluster topology changes and this could exacerbate the problem. Plus, in the case there was a Honeycomb API outage, we wouldn't want a refinery cluster to take itself down - the cluster nodes should be stable and utilise other tools (eg retries and memory limiting) to protect itself until it can deliver telemetry to Honeycomb.