Enhance default readiness probe to check health status by parsing the payload (requires curl)

EmilHorne commented 11 months ago

Requirements

[no] I have added test coverage for new or changed functionality
[X] I have followed the repository's pull request submission guidelines
[n/a] I have validated my changes against all supported platform versions

Related issues

See https://github.com/launchdarkly/ld-relay/pull/259 on the main ld-relay project.

Describe the solution you've provided

The revised readiness probe will not put the relay into service (make it discoverable) until its /status payload contains the string "healthy", which is unique to the top level "status" property (per-environment states are connected/disconnected, never healthy). It will also remove the relay from service (make it no longer discoverable for net-new connections) after a single probe response that does not contain "healthy" (e.g., top-level status can be "degraded" under various conditions including if any one environment is not connected).

Describe alternatives you've considered

See https://github.com/launchdarkly/ld-relay/pull/259 for a potential alternative.

Additional context

I've tested that this works if curl is included in the image. If curl isn't in the image, it will not work. I am unsure how to account for older relay images.

I haven't tested what happens to already-established SDK connections if the readiness state transitions from true to false (probe fails). I'm assuming they will not be interrupted as my understanding of the readiness state is that it only affects discoverability.

I am unsure if the version in chart.yaml should be incremented for this change, as only values.yaml has changed, not chart.yaml.

The validation check here will need to be changed to match the new readiness probe: https://github.com/launchdarkly/ld-relay-helm/blob/dca08b0d9cdf457b9e43153fdbe375834df8b8ba/test/deployment_test.go#L284

I'm not sure how to represent the new probe definition in that test code.

louis-launchdarkly commented 11 months ago

Hello @EmilHorne,

We will consider building a new endpoint more, but the current suggestion may not be generally applicable to all users.

For instance if the readiness probe checks for the word healthy, that means if there is a network outage or LD outage after Relay is initialized, then by the time Relay realizes the connection is gone (the default is 5 minutes), the /status will return degraded. That would trigger K8s taking the whole fleet of Relays out of serving requests even though they all still have the latest flag information that can serve all your SDKs and new SDKs that try to get online during the outage.

EmilHorne commented 11 months ago

Good point.

The current readiness probe behavior is acceptable except for the scenario where, during an LD outage, a newly started SDK connects to a relay that has no source of flag data for the environment of interest.

In that scenario we want the SDK to look for a different relay.

That scenario would be better handled from the SDK side but to do that requires a change to the relay behavior ensuring the relay always sends a 503 when it does not yet have flag data for the environment that the client is joining.

According to the documentation, when the relay has no source of flag data, it still accepts streaming connections:

The Relay Proxy accepts all streaming requests from SDKs, but they will not receive any data until the Relay Proxy has received flag data from LaunchDarkly.

louis-launchdarkly commented 11 months ago

Currently, the correct but understandable pretty inconvenient way is to do a more detailed parsing of the response from the /status endpoint, as for each environment, the response will contain state for connectionStatus and dataStoreStatus.

If you are not talking about sharing a persistent store (where Relay can have data from the store initialized by something else, then the definition of Ready is either

connectionStatus state equals to VALID or
connectionStatus state is not equal to INITIALIZING (so it has finished initialized at least once) and dataStoreStatus state equals VALID (so it has data to serve from the persistent data store).

We will look into the different possibilities to improve how Relay or the SDK behave so there is at least a clear recommended way to address this (beyond having the Readiness project make an actual streaming connection or an eval request to check)

launchdarkly / ld-relay-helm

Enhance default readiness probe to check health status by parsing the payload (requires curl) #48