Unexpected Redis Initialization Behaviour With Multiple Relay Proxies

kenanelgaouny commented 1 year ago

Describe the bug

We are using a k8s deployed relay-proxy with a redis datastore with cacheTTL= -1 The redis dataStore is shared across pods.

Redis is initialized correctly on startup with the correct number of flags. When the redis connection is lost and events are received from LaunchDarkly, it seems that server-sdk is losing it's inmem cache and re-initializing the datastore with a subset of the flags.

When this happens new connections to the relay proxy are not receiving the full flag set.

To reproduce

Create multiple Relay instances (i.e 5) with a shared redis and cacheTLL: -1.
Cut off the connection to redis after it's been already initialized.
Generate Multiple events (i.e flag changes) in LaunchDarkly.
Re-Enable the redis connection
Redis will get re-initialized with some of the flags, and relay proxy will only serve those flags

Expected behavior Redis should be re-initialized with all flags and the inMem cache should keep all it's values.

Logs relay-1-logs.txt relay-2-logs.txt

SDK version github.com/launchdarkly/go-server-sdk/v6 v6.1.0 github.com/launchdarkly/ld-relay/v7 v7.2.4

Language version, developer tools Go 1.20

OS/platform Alpine 3.18.0

Additional context The issue might be related to the clearing of the cache here

louis-launchdarkly commented 1 year ago

Hello @kenanelgaouny, thank you for the report, we will look into this. By the way, was this reproducible if you only have 1 Relay instance? Also, is the data from the Redis wiped when you disconnect the Relay from Redis?

Filed internally as 202755.

kenanelgaouny commented 1 year ago

HI @louis-launchdarkly,

I could not reproduce the issue with a single relay. The issue seems to happen when the SDK detects concurrent redis updates.

The data from redis is not wiped when disconnected, but when the relay re-initializes the data gets overwritten.

louis-launchdarkly commented 1 year ago

Hello @kenanelgaouny, we have a few more follow-up questions:

How many environments do you have in your Relay config?
How many segments and flags do you have in the environment you use to reproduce this behavior?

kenanelgaouny commented 1 year ago

@louis-launchdarkly I have 1 environment with ~861 flags, and around ~151 segments.

kenanelgaouny commented 1 year ago

@louis-launchdarkly I was able to workaround the issue by doing this: https://github.com/launchdarkly/go-server-sdk/compare/v6...kenanelgaouny:go-server-sdk:patch-2

Basically when the cache TTL is infinite, we store the update in the inmem cache, instead of clearing and reloading from the datastore.

I'm not sure if this is the ideal solution but it should help.

louis-launchdarkly commented 1 year ago

Hello @kenanelgaouny, thank you for the suggestion. I am investigating and discussing with the team what the correct behavior should be. I think the change you proposed will help the data wipe issue, but there may be an edge case that an older change can overwrite a newer change in the in-memory cache (updateSingleItem doesn't check the version and we think there are cases that can cause SSE events to arrive out of order). The change is still really helpful to point out where we need to investigate.

I will give another update here once we have the correct behavior nailed down.

louis-launchdarkly commented 1 year ago

The team has discussed this and this is a bigger redesign issue as there are many interleaving cases when there are multiple SDKs/RelayProxies trying to write to a disconnected data store in the infinite cache mode. Infinite cache mode assumes the cache always maintains its latest state, but our persistent store wrapper is relying on the actual datastore to handle the message versioning. So the long-term fix, we need to ensure the in-memory cache update operation also respects the version label in the streaming messages.

That is a much larger change - so in the meantime, I wonder, can you use a really long cache expiration time instead of infinite cache mode? While the in-memory state can't be updated when the data store is down, what this will do is that the streaming connection will reconnect once the data store is up again. That will ensure the Relay is serving a consistent latest state for any SDKs connecting to it.

github-actions[bot] commented 11 months ago

This issue is marked as stale because it has been open for 30 days without activity. Remove the stale label or comment, or this will be closed in 7 days.

williambanfield commented 2 months ago

I see that this issue was closed. We recently hit this issue, which caused a number of errors related to serving default values instead of the correct flag value. I'm assuming there has still be no movement on this issue or no later changes that would impact this behavior?

louis-launchdarkly commented 2 months ago

Hello @williambanfield, no public visible change for the Go Server SDK yet. We have designed the https://github.com/launchdarkly/cpp-sdks with this in mind so the latest C++ SDK can't overwrite the data in the persistent store incorrectly, but we are still working on the design for to have a better interaction with the persistent store in general.

In the meantime, either setting a large TTL or having each Relay instance use a non-shared persistent store is the current suggested workaround.

williambanfield commented 2 months ago

Thank you for the response @louis-launchdarkly. I'm trying to understand the TTL behavior. Does the in memory store only update its flag values when the TTL expires or does it update them as updates are delivered from the LaunchDarkly flag network? My understanding is that no updates from the flag network will be absorbed at all when the persistent store is unreachable, but I'm not sure I understand the behavior when the persistent store is connected. If flag updates to the in memory store are only absorbed as the old value's TTL expires, setting a very long TTL is not a very viable option.

cwaldren-ld commented 1 month ago

Hi @williambanfield,

Given:

A finite TTL
Persistent store is down

Then: the SDK will not update its in-memory cache when updates are received from LaunchDarkly SaaS. The rationale is that it's better for the SDK to perform evaluations on an old set of data over time, than to:

Switch to evaluating based on new data
(TTL expires)
Revert to evaluating based on old data

Given:

A finite TTL
Persistent store is up

Then the SDK will do the following when an update is received from LaunchDarkly SaaS:

It will attempt to store the update in the persistent store. This operation may succeed even if the data is not stored due to..
It checks if the update contains data that is older than what is already in the store. a. If so, it won't be stored, otherwise it will be.
If it is older, it will purge the cache entry and then pull it from the persistent store back into the cache
If it is newer, it will place it into the cache

Would you mind giving a rough outline of how you are using these components? Are you using Relay w/ persistent store, or just the Go Server SDK independently and a persistent store?

williambanfield commented 1 month ago

Thanks @cwaldren-ld,

Are you using Relay w/ persistent store, or just the Go Server SDK independently and a persistent store?

We're using a relay proxy with a Redis persistent store. We are then pointing services at the relay proxy.

Then the SDK will do the following when an update is received from LaunchDarkly SaaS:

So that I understand, this happens whenever data is received, not just at every TTL interval, is that correct?

cwaldren-ld commented 1 month ago

So that I understand, this happens whenever data is received, not just at every TTL interval, is that correct?

That should be the case.

You may find these diagrams helpful for some context:

https://github.com/launchdarkly/ld-relay/blob/v8/docs/persistent-storage.md#example-persistent-store-with-30-second-ttl

launchdarkly / go-server-sdk

Unexpected Redis Initialization Behaviour With Multiple Relay Proxies #92