Recovery after data store outage

imacleod commented 1 year ago

Describe the bug After recovering from even a brief (1-2 seconds) period of unavailability of the data store (redis), LdClient's AllFlagsState method returns zero enabled feature flags. After this, enabling a feature flag in the web UI results in only that single flag being returned by the client rather than all enabled flags.

If this is due to using streaming for feature flag updates, will the data store ever be repopulated with the state of all feature flags if no flag changes are made?

To reproduce

Enable multiple feature flags
Run sample app (provided in Additional context section)
After initialization stop redis
Start redis after two seconds (well within cache duration)
SDK returns zero enabled feature flags (data store is empty)
Turn off flag in web UI (no change, zero enabled feature flags returned)
Turn on flag in web UI
Only the newly enabled feature flag is returned, none of the other enabled flags are, even after the data store cache expires

Adding polling and request cache busting slightly improves the situation. The SDK still returns zero enabled feature flags after the data store outage, but making a flag change does then result in all enabled feature flags being returned from the AllFlagsState method.

Expected behavior After recovering from a data store outage the SDK will repopulate the data store with the state of all feature flags.

SDK version 7.0.3

Language version, developer tools .NET 6.0

OS/platform Ubuntu 22.04

Additional context Sample app:

using System.Text.Json;
using LaunchDarkly.Logging;
using LaunchDarkly.Sdk;
using LaunchDarkly.Sdk.Server;
using LaunchDarkly.Sdk.Server.Integrations;

const int delayBetweenRequestsMs = 300; 
const string sdkKey = "...";

var runId = 0;

var cancellationTokenSource = new CancellationTokenSource();
var cancellationToken = cancellationTokenSource.Token;

var companyIds = new List<string>()
{
    "...",
    "...",
    "..."
};

LdClient CreateLdClient()
{
    var configurationBuilder = Configuration.Builder(sdkKey)
        .Logging(Components.Logging().Level(LogLevel.Debug))
        .DataStore(
            Components.PersistentDataStore(
                    Redis.DataStore()
                        .HostAndPort("localhost", 6379))
                .CacheTime(TimeSpan.FromSeconds(20)));

    return new LdClient(configurationBuilder.Build());
}

static List<string> GetLaunchDarklyEnabledFeatureFlagsJson(FeatureFlagsState featureFlagsState)
{
    var enabledFeatureFlagsJson = new List<string>();

    foreach (var (key, value) in featureFlagsState.ToValuesJsonMap())
    {
        if (value.AsBool)
        {
            enabledFeatureFlagsJson.Add(key);
        }
    }

    return enabledFeatureFlagsJson;
}

var launchDarklyClient = CreateLdClient();

while (!cancellationToken.IsCancellationRequested)
{
    foreach (var companyId in companyIds)
    {
        var task = Task.Run(() =>
        {
            var launchDarklyUserContext = Context
                .Builder(Guid.NewGuid().ToString())
                .Set("companyId", companyId)
                .Build();

            launchDarklyClient.Identify(launchDarklyUserContext);

            var launchDarklyFeatureFlags = launchDarklyClient.AllFlagsState(launchDarklyUserContext);
            var enabledFeatureFlags = GetLaunchDarklyEnabledFeatureFlagsJson(launchDarklyFeatureFlags);
            enabledFeatureFlags.Sort();

            Console.WriteLine(
                $"{DateTime.Now}: runId: {runId}, companyId: {companyId}, enabledFeatureFlags: {JsonSerializer.Serialize(enabledFeatureFlags)}");

            runId++;
        }, cancellationTokenSource.Token);

        task.Wait();
        Thread.Sleep(delayBetweenRequestsMs);
    }
}

kinyoklion commented 1 year ago

Hello @imacleod,

Thank you for reporting this. We will look into it.

Filed internally as 205356

Thank you, Ryan

kinyoklion commented 1 year ago

Hello @imacleod,

When you say After initialization stop redis what do you mean? Are you running like a docker container without any state, or a redis instance that will have state? If I run your example, simulating an outtage, then I do not get behavior that matches your description.

If I delete everything in redis, then I will eventually see no flags, and I will get behavior that is like you describe. Which is different than a loss of connection to a redis instance that will be properly populated.

The TTL looks to be applying correctly. The TTL is the time from when the SDK reads data from redis until the items it read expire.

So, using your example application.

1.) I start a stopwatch and start the application. I see data in the window. 2.) I delete everything from redis. Which if you use a container without any snapshots would be like restarting the timer. 3.) Once my stopwatch hits 20 seconds, which is the cache time configured, I start getting empty sets of flags.

The persistent stores treat the database as the source of truth. If you clear the DB, then the truth is now no flags.

When you change a flag the SDK will add that flag to redis. Now you have 1 flag in redis.

Thank you, Ryan

imacleod commented 1 year ago

Hi @kinyoklion,

When I said "stop redis" yes I am running redis in a docker container and am stopping the container.

When that happens the SDK serves feature flag variations from its in-memory cache for the duration of the TTL, which is expected. What was unexpected was that bringing redis back up well within the cache duration (before the TTL expired) did not prevent an empty list of feature flag variations from being returned from the SDK when the cache did eventually expire.

In this scenario the expected behavior was that the SDK would detect the change in availability of redis and redis would have its feature flag data repopulated from the SDK's in-memory cache when redis came back online. Are you saying that the SDK will not use its in-memory cache to repopulate a data store that has recently recovered from an outage?

The other concern here is that different runs of the provided sample application have produced different results. Some runs have the behavior described in my initial comment, and some runs do not (the SDK continues to serve the correct feature flag state after redis is back online and the cache TTL has expired).

kinyoklion commented 1 year ago

In this scenario the expected behavior was that the SDK would detect the change in availability of redis and redis would have its feature flag data repopulated from the SDK's in-memory cache when redis came back online. Are you saying that the SDK will not use its in-memory cache to repopulate a data store that has recently recovered from an outage?

This is not how the SDK uses stores. It populates a store when it gets an init, it updates elements as it gets upserts, and it reads from the store whenever it needs flag data that has expired from its TTL.

It expects stores to be persistent, and also that stores may have more up to date data than its local cache. So if a store is temporarily not-available, or the store stops/starts, the store should recover the data it already had.

If you expect a single SDK instance to have an authoritative in memory cache, then you can set an infinite cache timeout. The behavior in this case is a little different and may be closer to what you are attempting to achieve.

If the value is negative (such as <see cref="System.Threading.Timeout.InfiniteTimeSpan"/>), data is cached
        /// forever (i.e. it will only be read again from the database if the SDK is restarted).

Stores can also be populated via relay proxy, in which case the SDK never writes to the store. (Though it still is expected that the store is persistent).

There are only a couple things I know that could cause some run-to-run variance. If your SDK lost its connection to the stream, then it would get a new init when it reconnected and it would write that new payload into the store. It also, of course, will write flag updates it receives into the store.

Thank you, Ryan

imacleod commented 1 year ago

The infinite cache timeout does sound closer to the desired behavior. I've seen two incidents where the SDK was serving zero enabled feature flags in spite of a dozen flags being enabled. Logs didn't offer any clues, and the steps I outlined in my first post while running the sample app are the closest that I've gotten to reproducing the incidents.

Since the SDK does not treat the in-memory cache as authoritative I will close this issue. Thanks for the discussion.

imacleod commented 1 year ago

This came up again over the weekend so I am reopening the issue. The SDK design decision to treat the persistent data store (which is an in-memory cache when using redis) as the source of truth is problematic.

Redis instances do not universally use persistence, but from this discussion it sounds like the SDK is expecting them to. This coupled with the fact that cloud providers can and will restart redis instances at will, as well as for routine maintenance, makes this an issue.

Due to that, the expected behavior is that the SDK will detect the redis data store coming back online and repopulate it with the current feature flag state from LaunchDarkly. With a data store going offline for any amount of time, LaunchDarkly should be the authority on the correct feature flag state, rather than a recently recovered data store.

tkent commented 1 year ago

@kinyoklion

Agreeing with @imacleod this is a very surprising statement with Redis being a supported storage provider.

It expects stores to be persistent, and also that stores may have more up to date data than its local cache. So if a store is temporarily not-available, or the store stops/starts, the store should recover the data it already had.

Redis is fundamentally not a persistent storage service, it's a memory data store. While it happens to have a feature to write it's entries to persistent storage, even this is not atomic in the sense of other databases. Some data loss must always be accounted for, even in the fsync at every query case.

If the SDK requires Redis to always have a valid state or, in the event of a failure, to recover back to the last write, then this problem will keep showing up. The frequency it shows up at will be specific to the in-use Redis configuration, but in no configuration will it never come up.

I'd also like to ask about this:

If you expect a single SDK instance to have an authoritative in memory cache, then you can set an infinite cache timeout. The behavior in this case is a little different and may be closer to what you are attempting to achieve.

I believe this scenario is effectively the same as just not specifying any DataStore configuration (so it defaults to InMemoryDataStore), correct? Where each application instance using the SDK would maintain a local in-memory cache, all getting updates from the LD APIs independently.

If your suggestion is different than that, can you elaborate a little?

louis-launchdarkly commented 1 year ago

Hello @imacleod and @tkent,

Talking about the simple scenario where there is one SDK running using a single instance of Redis in infinite cache mode, the key differences compared to not using a DataStore are:

When the SDK is starting up, if Redis has flag data for the same environment previously, the SDK will use that before the streaming connection is established instead of the default in the variation calls.
Infinite Cache mode means SDK should have all the latest data from streaming, so if an SDK detects the Redis store went away and come back online, the SDK will try to repopulate Redis using its memory.

kinyoklion commented 1 year ago

@imacleod and @tkent,

@louis-launchdarkly Has described the difference in behavior versus running without a store at all.

I agree that this may not fit with expectations. Generally speaking persistent store configurations are something you should use only with very specific reasons, otherwise the default SDK behavior will be better. It isn't generally an improvement to resilience, and for a single SDK configuration it is unlikely to improve performance versus just having things in memory.

It can provide a slight advantage in the case of the infinite TTL, which it can use last known values versus having to wait for data from LaunchDarkly.

Using a persistent store with relay proxy, where it is updating it consistently, and then using an SDK in daemon mode, can be a way to reduce connections to LD or reduce network transfers, for instance.

kinyoklion commented 1 year ago

What specifically is the behavior you are after that provides benefit to having a store other than memory?

imacleod commented 1 year ago

@kinyoklion We planned to use the persistent store for redundancy in the event of an issue with the in-memory store.

kinyoklion commented 1 year ago

@imacleod The in-memory store is effectively a dictionary. It is getting populated and updated by an initial payload and then patches streamed to it. If there is an issue with the memory store, the persistence doesn't help that. Likely this would be major memory corruption and nothing would really help.

Basically it can do these things: 1.) It can reduce initialization latency by allowing evaluations before the polling/streaming request have completed. It can also allow for operation when the LD connection cannot be established for some reason, but previously had been.

2.) It can reduce the memory footprint. Using the TTL cache and evaluating single flags you can have just those flags in memory. (This use case is effectively eliminated if you ever call allFlags as it will result in greater than baseline memory usage in that case).

3.) The last case that it can be useful is to allow for a specific form of isolated infrastructure. Which is that the relay proxy can populate the store and an SDK can read the store. In this form the process using the SDK, and the process populating the SDK, only need shared access to Redis/Dynamo. This can be desirable for some network topologies.

Outside those use cases it would generally not be beneficial. It would just increase complexity and decrease performance.

This isn't the same as say a general redis/memcached situation, where redis is basically acting like the storage for a read-through cache.

Thank you, Ryan

imacleod commented 1 year ago

@kinyoklion Thanks for summarizing the abilities of the persistent data store. Some of that information is available in the Persistent data stores docs but not all of it. It would be helpful to update those docs with these considerations.

The infinite cache mode is only mentioned in the Relay proxy docs, and since it can be used without a relay proxy it would be great to have it detailed in the broader Storing data docs.

kinyoklion commented 1 year ago

@imacleod The infinite TTL mode is not broadly supported by the SDK implementations. The stores are currently not as consistent between SDKs as we would like, and we plan to start addressing that, but do not have an eta.

Thank you, Ryan

github-actions[bot] commented 9 months ago

This issue is marked as stale because it has been open for 30 days without activity. Remove the stale label or comment, or this will be closed in 7 days.

launchdarkly / dotnet-server-sdk

Recovery after data store outage #174