Open barakbd opened 2 years ago
I noticed that ERROR log was generated even without making a connection a feature flag request from an SDK. Does the relay proxy send a heartbeat or does the server make a call to the LDRP? In other words, wow does the LDRP know the auto-config key changed?
That's because the Relay Proxy maintains a streaming connection LaunchDarkly for auto-config information updates, not unlike the other streaming connection that is used to get flag data updates. This allows it to immediately detect changes like adding or removing an environment from your auto-config profile on the dashboard.
If you change the key, then LaunchDarkly immediately closes that connection— since you would presumably not want anyone who had previously connected with that old key to keep on receiving SDK keys and other information via the auto-config stream. That would be a security violation.
Anyway... regarding your suggestion that the Relay Proxy should quit if this happens, I think we'll have to discuss it internally. Personally I'm not convinced that this is right— I mean, on the one hand the Relay Proxy is not going to be able to keep functioning in any useful way if this has happened, but on the other hand, we have never before defined any circumstance in which it would just terminate the process after having previously had a successful startup, no matter what happened, so this might be surprising/undesirable behavior for some users.
What I'm not sure about is whether this condition is currently detectable via external monitoring. I mean, the error log message is meant to help get your attention in that regard, but if for instance the /status
resource doesn't also reflect that there's a problem, it probably should. So I'll look into that.
The /status
endpoint keeps returning 200 reporting and reporting the environments. It should not. BTW, we are using that endpoint as the livelinessProbe
in K8S, so if it did not return 200, K8S would automatically kill the pod.
There should be 2 fixes:
/status
should not return 200Well, there are three different issues here.
First, yes, we want this failure mode to be discoverable in some way.
Second, regarding the HTTP semantics of the /status resource. It's been consistently the behavior of the Relay Proxy that the /status
resource returns 200 as long as the Relay Proxy is actually able to receive and respond to requests on that endpoint. We do not use a status like 503 as a response from that resource, because it would be misleading in terms of HTTP semantics, and also because we want the caller to be able to see details of the service's operational status, not just "it's broken, we won't give you any status details beyond that". Existing users of the Relay Proxy rely on this behavior. So, if we were to provide a way for a monitoring service to make a simple "yes it's fine" vs. "it's bad, kill it" decision based solely on the HTTP response status, we would do that by adding a new endpoint, not by changing the semantics of the current endpoint.
However, since the Relay Proxy is capable of doing more than one thing, this raises a third question of how broken would it have to be before we say "it's bad, kill it." I think in fact I was wrong in my previous comment to say that "the Relay Proxy is not going to be able to keep functioning in any useful way if this has happened [i.e. if the auto-config key has been changed]". It still possesses valid SDK keys of whatever environments it was using up till that point (assuming those keys have not also been rotated). It has only lost the ability to automatically detect changes to the configuration. It is not self-evident that it ought to stop serving flags to existing clients in this case; maybe that's what you want, but it isn't necessarily what everyone would want. There is already a different way to indicate "I don't want existing services that possess this SDK key to be able to continue to get flags", and that is to rotate the SDK key.
So I think we will need to do some more internal discussion and the answer may be that this should be configurable behavior. Our customers have many different approaches to deployment and monitoring.
I tested the ability of the LDRP to relay feature flags once the auto-config key has changed, using this example repo: https://github.com/launchdarkly/hello-js I added color to the example code, so when the feature flag is "On" (which it is) text should change green and say "Showing your feature to bob@example.com".
The LDRP is not able to serve feature flag values and requests return with an error.
Well, at this point I'm pretty confused. I'm the primary maintainer of the Relay Proxy code, but there's a lot of it so I certainly may have missed or forgotten something. And I haven't yet tried to replicate the test you described. But from my reading of the code right now... I do not see a way for it to do what you're saying it does. It looks to me like it only deactivates an environment if LaunchDarkly specifically told it that that environment was removed from the configuration, or that that SDK key (as in the environment key, not the auto-config key) was deactivated, and I don't see any code path from "auto-config stream was disconnected and then got a 401 error" to deactivating the environments, even if it would arguably be logical for it to do so. The code I'm looking at seems to match the intended behavior as I remembered it, which was "quit on startup if the auto-config key is bad, but after that, any errors on the auto-config stream just mean we won't get updates any more."
I am happy to get on a Zoom call to test together. You can email me at barak.ben.david@anaplan.com and we can schedule.
@barakbd Thank you but I do need to begin with some testing on my own here, as well as further review of the code— and also, it's not clear yet how high a priority this is, not to mention that there is still not agreement on the desired behavior. So I can't carve out time to collaborate closely like that.
Any updates please?
The issue is we need a way to alert when the key is invalid, as from my testing, the LDRP will stop working.
We could possibly rely on the logs, however, this is cumbersome. It would also make sense to have the container exit, as a new deployment, using the updated secret (auto config key) would be needed anyways. If the container exits, we can alert based on the k8s.container.ready
metric (OTEL collector metric.
Hope this makes sense.
Hello @barakbd, thank you for reaching out to us, I think it will be helpful to understand your use case a bit more through a support ticket, do you mind creating a support ticket with us?
Support ticket create: #20701
@eli-darkly - I just retested and you are correct. The LDRP continues to serve requests, until the container is killed and tres to restart, at which point it fails with a non-zero code. It seems this is the desired behavior, however I can see this causing confusion when someone tries to update the configuration (add/remove Project/Environment) and the LDRP would not respect the new configuration.
Describe the bug
Container does not exit when auto config key is incorrect due to a configuration reset.
To reproduce
Expected behavior The container should exit with non-zero code. See related issue: https://github.com/launchdarkly/ld-relay/issues/165
Logs
Docker Image https://hub.docker.com/layers/launchdarkly/ld-relay/latest/images/sha256-8932028a51c815a016b73a265c5102ec9bfa99125464a9470e6e44768601a4df?context=explore
Notes
I noticed that ERROR log was generated even without making a connection a feature flag request from an SDK. Does the relay proxy send a heartbeat or does the server make a call to the LDRP? In other words, wow does the LDRP know the auto-config key changed?