tompave / fun_with_flags

Feature Flags/Toggles for Elixir
https://hexdocs.pm/fun_with_flags/FunWithFlags.html
MIT License
1.06k stars 79 forks source link

Avoid Thundering Herd on Rollout of new Release #177

Open probably-not opened 2 months ago

probably-not commented 2 months ago

When a rollout/deployment occurs, the feature flags that are cached in ETS are all flushed (as a new node will start up without the same data). This leads to a Thundering Herd situation, as all of the requests to the new node make requests to the persistence adapter (until the cache fills once more).

This can be solved in a few different ways:

tompave commented 1 month ago

Hey, thank you for using the library and raising this point.

You're making a fair observation. And this is also related to this long-standing todo item in the readme:

  • Add some optional randomness to the TTL, so that Redis or the DB don't get hammered at constant intervals after a server restart.

Both of your suggestions have merits.

Pre-filling the cache: on startup, allow a configuration that will pre-fill the ETS cache in memory. This will ensure that the cache is already full after startup completes. This could cause bloated memory, if for example the feature flags table contains a lot of old flags that are no longer in use.

This is a simple and effective solution, and as far as I know it's something that applications are already doing when encountering the problem you describe. That's because this doesn't need to be part of FWF itself, and it can be done in application code. The pattern can even be generalized and extracted as a 3rd party extension to FWF, and published on Hex.

It still doesn't entirely remove the problem though, as the TTL of all cached flags will be the same, and they'll all expire at roughly the same time. At least, assuming that a large enough number of different flags will be queried frequently enough to cause a problem, let's say during the lifecycle of different web requests in a high traffic application.

  • Single-flight mechanics: Using a single-flight mechanism to ensure that only one request per key to the persistence adapter is made at any given time. This is fairly easy to implement with a GenServer + handle_call + GenServer.reply, i.e. the persistence_adapter().get() call would be wrapped in a GenServer to ensure that only one call is running at any given time.

This is interesting.

I don't think I've considered it before because it is possible to disable the ETS cache, and I wouldn't want to put such a bottleneck in front of the persistence adapter all the time. If we went with this solution, it would be have to be conditional to the ETS cache being enabled.

But it still sounds like something that can be done with a custom persistence adapter. Since adapters have a consistent interface, it should be possible to implement a generic enough "persistence adapter proxy", or perhaps middleware, to do what you describe before forwarding calls to the underlying actual adapter.

As I'm always a fan of keeping FWF simple, I'd be inclined to see this too as an extension of the library on Hex, rather than something that is part of FWF itself.