Closed kenanelgaouny closed 11 months ago
Hello @kenanelgaouny, thank you for the report, we will look into this. By the way, was this reproducible if you only have 1 Relay instance? Also, is the data from the Redis wiped when you disconnect the Relay from Redis?
Filed internally as 202755.
HI @louis-launchdarkly,
I could not reproduce the issue with a single relay. The issue seems to happen when the SDK detects concurrent redis updates.
The data from redis is not wiped when disconnected, but when the relay re-initializes the data gets overwritten.
Hello @kenanelgaouny, we have a few more follow-up questions:
@louis-launchdarkly I have 1 environment with ~861 flags, and around ~151 segments.
@louis-launchdarkly I was able to workaround the issue by doing this: https://github.com/launchdarkly/go-server-sdk/compare/v6...kenanelgaouny:go-server-sdk:patch-2
Basically when the cache TTL is infinite, we store the update in the inmem cache, instead of clearing and reloading from the datastore.
I'm not sure if this is the ideal solution but it should help.
Hello @kenanelgaouny, thank you for the suggestion. I am investigating and discussing with the team what the correct behavior should be. I think the change you proposed will help the data wipe issue, but there may be an edge case that an older change can overwrite a newer change in the in-memory cache (updateSingleItem
doesn't check the version and we think there are cases that can cause SSE events to arrive out of order). The change is still really helpful to point out where we need to investigate.
I will give another update here once we have the correct behavior nailed down.
The team has discussed this and this is a bigger redesign issue as there are many interleaving cases when there are multiple SDKs/RelayProxies trying to write to a disconnected data store in the infinite cache mode. Infinite cache mode assumes the cache always maintains its latest state, but our persistent store wrapper is relying on the actual datastore to handle the message versioning. So the long-term fix, we need to ensure the in-memory cache update operation also respects the version label in the streaming messages.
That is a much larger change - so in the meantime, I wonder, can you use a really long cache expiration time instead of infinite cache mode? While the in-memory state can't be updated when the data store is down, what this will do is that the streaming connection will reconnect once the data store is up again. That will ensure the Relay is serving a consistent latest state for any SDKs connecting to it.
This issue is marked as stale because it has been open for 30 days without activity. Remove the stale label or comment, or this will be closed in 7 days.
I see that this issue was closed. We recently hit this issue, which caused a number of errors related to serving default values instead of the correct flag value. I'm assuming there has still be no movement on this issue or no later changes that would impact this behavior?
Hello @williambanfield, no public visible change for the Go Server SDK yet. We have designed the https://github.com/launchdarkly/cpp-sdks with this in mind so the latest C++ SDK can't overwrite the data in the persistent store incorrectly, but we are still working on the design for to have a better interaction with the persistent store in general.
In the meantime, either setting a large TTL or having each Relay instance use a non-shared persistent store is the current suggested workaround.
Thank you for the response @louis-launchdarkly. I'm trying to understand the TTL behavior. Does the in memory store only update its flag values when the TTL expires or does it update them as updates are delivered from the LaunchDarkly flag network? My understanding is that no updates from the flag network will be absorbed at all when the persistent store is unreachable, but I'm not sure I understand the behavior when the persistent store is connected. If flag updates to the in memory store are only absorbed as the old value's TTL expires, setting a very long TTL is not a very viable option.
Hi @williambanfield,
Given:
Then: the SDK will not update its in-memory cache when updates are received from LaunchDarkly SaaS. The rationale is that it's better for the SDK to perform evaluations on an old set of data over time, than to:
Given:
Then the SDK will do the following when an update is received from LaunchDarkly SaaS:
Would you mind giving a rough outline of how you are using these components? Are you using Relay w/ persistent store, or just the Go Server SDK independently and a persistent store?
Thanks @cwaldren-ld,
Are you using Relay w/ persistent store, or just the Go Server SDK independently and a persistent store?
We're using a relay proxy with a Redis persistent store. We are then pointing services at the relay proxy.
Then the SDK will do the following when an update is received from LaunchDarkly SaaS:
So that I understand, this happens whenever data is received, not just at every TTL interval, is that correct?
So that I understand, this happens whenever data is received, not just at every TTL interval, is that correct?
That should be the case.
You may find these diagrams helpful for some context:
Describe the bug
We are using a k8s deployed relay-proxy with a redis datastore with
cacheTTL= -1
The redis dataStore is shared across pods.Redis is initialized correctly on startup with the correct number of flags. When the redis connection is lost and events are received from LaunchDarkly, it seems that server-sdk is losing it's inmem cache and re-initializing the datastore with a subset of the flags.
When this happens new connections to the relay proxy are not receiving the full flag set.
To reproduce
cacheTLL: -1
.Expected behavior Redis should be re-initialized with all flags and the inMem cache should keep all it's values.
Logs relay-1-logs.txt relay-2-logs.txt
SDK version github.com/launchdarkly/go-server-sdk/v6 v6.1.0 github.com/launchdarkly/ld-relay/v7 v7.2.4
Language version, developer tools Go 1.20
OS/platform Alpine 3.18.0
Additional context The issue might be related to the clearing of the cache here