What happens if LD fails or gives unexpected latency?

ljacobsson commented 4 years ago

In the setup, is the URL in serviceCollection.AddFeatureToggleClient("https://api.feature-service.com"); ,meant to be our endpoint at LaunchDarkly or some internal proxy service?

Regardless, that endpoint can fail or for whatever reason be slow. We cache the response from this with a default value of 60s. If the endpoint starts returning errors, then nothing will get cached and all services using this will start acting up.

I don't think we should have a direct synchronous dependency to a third party (and also avoid it with our internal services).

Is there no way to push toggle statuses out via events so each stack can look up its toggle in a local dynamo table. That's more scalable and will ensure that we don't hit any request limits on the LD side.

ljacobsson commented 4 years ago

Ah I see now that this is meant to be a proxy service int-feature-flags which talks to a dynamo table. How is that table populated? On coldstart?

ljacobsson commented 4 years ago

I would opt for a solution where int-feature-flags has resposibility of keeping the dynamo table up to date. Each stack that needs feature flags uses this package (refactired to read from dynamo (<10ms & HA) instead of over http (~50-1000ms depending on coldstart on the other end)).

We could create a macro that:

Creates the dynamo table
Creates subscription from int-feature-flags-FlagTable's updates
Injects lambda to populate table based on events
Injects permissions to lambda functions in stack to read from said table

A small time investment, but will make it more resilient

mattiasliljenzin commented 4 years ago

I would opt for a solution where int-feature-flags has resposibility of keeping the dynamo table up to date. Each stack that needs feature flags uses this package (refactired to read from dynamo (<10ms & HA) instead of over http (~50-1000ms depending on coldstart on the other end)).

We could create a macro that:

Creates the dynamo table

Creates subscription from int-feature-flags-FlagTable's updates

Injects lambda to populate table based on events

Injects permissions to lambda functions in stack to read from said table

A small time investment, but will make it more resilient

Some key points:

Our own feature flag service is caching the flags for 30 seconds
The C#-client caches the response for 30 seconds
The cold start is quite low, around 400-500 ms
The average call is around 60-70 ms from client -> service

At the moment, the lambda subscribes to changes via a stream to keep the DynamoDb up to date. I think this happens in the background, but worth checking out. If we see signs of performance issues, the first step should be to make our feature flag service only read from dynamo, and create a new service that subscribes to flag changes via webhook or similar and then update the dynamo. Both would reside in the same stack.

So, let's get some real-life numbers on performance before we do anything. Caching will hopefully take care of most things.

https://docs.launchdarkly.com/docs/using-a-persistent-feature-store

ljacobsson commented 4 years ago

Our own feature flag service is caching the flags for 30 seconds

Is uncached queries always going to dynamodb and never to LD? If so I think that's ok if it purely acts relay

The C#-client caches the response for 30 seconds

The userkey is part of the cache key, so in reality there will only be cache hits if the same member accesses the same endpoint twice in a 30s interval?

The cold start is quite low, around 400-500 ms

I guess the issue is if the LD.init() access LD over the internet and LD is having issues (which we have to expect). Then if LD.init() fails, the lambda will fail and the execution environment will be dumped and it will set off a series of cold starts.

The average call is around 60-70 ms from client -> service

That's quite high extra latency

At the moment, the lambda subscribes to changes via a stream to keep the DynamoDb up to date. I think this happens in the background, but worth checking out.

This will not work. I havent dug deep, but it feels like the code here and here will, as you say, listen to updates on the LD side and populate the dynamo table in the background. However, after the ~10ms execution, the lambda environment gets frozen until the next invocation, so no updates can reach the table. To me it looks like the data subscription is much better fitted to run in a long running container. Maybe one for fargate?

Can't we just have a scheduled task that gets all flags once per minute and then propagate the data out to the services and by doing that decrease the need for caching and thus get faster updates of toggles?

mattiasliljenzin commented 4 years ago

Our own feature flag service is caching the flags for 30 seconds

Is uncached queries always going to dynamodb and never to LD? If so I think that's ok if it purely acts relay

The C#-client caches the response for 30 seconds

The userkey is part of the cache key, so in reality there will only be cache hits if the same member accesses the same endpoint twice in a 30s interval?

The cold start is quite low, around 400-500 ms

I guess the issue is if the LD.init() access LD over the internet and LD is having issues (which we have to expect). Then if LD.init() fails, the lambda will fail and the execution environment will be dumped and it will set off a series of cold starts.

The average call is around 60-70 ms from client -> service

That's quite high extra latency

At the moment, the lambda subscribes to changes via a stream to keep the DynamoDb up to date. I think this happens in the background, but worth checking out.

This will not work. I havent dug deep, but it feels like the code here and here will, as you say, listen to updates on the LD side and populate the dynamo table in the background. However, after the ~10ms execution, the lambda environment gets frozen until the next invocation, so no updates can reach the table. To me it looks like the data subscription is much better fitted to run in a long-running container. Maybe one for fargate?

Can't we just have a scheduled task that gets all flags once per minute and then propagates the data out to the services and by doing that decrease the need for caching and thus get faster updates of toggles?

You can define a webhook that listens for changes. But only one webhook, not one per environment. So, maybe the Fargate alternative is better. Or, in the test environment, we could actually use read-through since it's only in Test. And in production use Relay. That would actually be the quickest and easiest alternative.

It seems a bit overkill that creating one dynamodb instance for each service that will use feature toggle. We should at least try this before taking the next step.

Also, I know that the caching is per member, but our feature service uses in-memory, so the only cost is the internal API-call between the client and the service, which is totally fine imo.

ljacobsson commented 4 years ago

You can define a webhook that listens for changes. But only one webhook, not one per environment. So, maybe the Fargate alternative is better. Or, in the test environment, we could actually use read-through since it's only in Test. And in production use Relay. That would actually be the quickest and easiest alternative.

I think we should do the same setup in test and prod. I think Fargate is overkill. Why can't we use multiple webhooks? The docs says we can. A pricing thing? If for whatever reason we can only ose one, then we could do the environment routing on our side.

It seems a bit overkill that creating one dynamodb instance for each service that will use feature toggle. We should at least try this before taking the next step.

Not if we make it seamless using macro

Also, I know that the caching is per member, but our feature service uses in-memory, so the only cost is the internal API-call between the client and the service, which is totally fine imo.

I think we should favour performance and HA. An internet roundtrip should be avoided for this imo.

I think the event driven approach is both a better serverless fit, more resilient and as easy to implement as a polling approach. A start could be to populate the table using webhooks.

mattiasliljenzin commented 4 years ago

You can define a webhook that listens for changes. But only one webhook, not one per environment. So, maybe the Fargate alternative is better. Or, in the test environment, we could actually use read-through since it's only in Test. And in production use Relay. That would actually be the quickest and easiest alternative.

I think we should do the same setup in test and prod. I think Fargate is overkill. Why can't we use multiple webhooks? The docs says we can. A pricing thing? If for whatever reason we can only ose one, then we could do the environment routing on our side.

It seems a bit overkill that creating one dynamodb instance for each service that will use feature toggle. We should at least try this before taking the next step.

Not if we make it seamless using macro

Also, I know that the caching is per member, but our feature service uses in-memory, so the only cost is the internal API-call between the client and the service, which is totally fine imo.

I think we should favour performance and HA. An internet roundtrip should be avoided for this imo.

I think the event driven approach is both a better serverless fit, more resilient and as easy to implement as a polling approach. A start could be to populate the table using webhooks.

Ah, I did not know you could define a webhook per environment. That solves the problem :) Let's start with this and if we see performance issues we can discuss the next step. If we do it, that would need to be a separate project or github project to keep things on the srp-side.

mhlabs / mhlabs.feature-toggle.client

What happens if LD fails or gives unexpected latency? #1