mailgun / groupcache

Clone of golang/groupcache with TTL and Item Removal support
Apache License 2.0
495 stars 73 forks source link

feat: add ability to clear cache #54

Open ct16k opened 1 year ago

ct16k commented 1 year ago

While the LRU package has the ability to purge all items from cache, this functionality was not available to ProtoGetter, making it imposibile to clear the cache without restarting all peers. This change adds a Clear() method to ProtoGetter, that enables clearing the cache with no downtime.

thrawn01 commented 1 year ago

Thanks for the PR! What is the use case for this feature? We already have remove functionality, why would we need to purge the entire cache?

We should note in the documentation for this feature that this is a point in time purge, such that it's possible a value from before the purge could survive if one of the other instances in the cluster has the value, but hasn't been purged it yet, then that value is requested by an instance that has already been purged. If this functionality is expected to purge all values without the chance for re-propagation of the values before the purge then a different solution is needed.

This can be confusing, so here is a sequence diagram

sequenceDiagram
    actor User
    User->>Node1: Purge
    Node1->>Node2: Purge
    Node1->>Node3: Request a value from Node3
    Node1->>Node3: Purge

In this way, a value from Node3 survives the purge.

In the case of a single Remove() call, we delete the value from the value owner first, then tell all others to clear the value.

ct16k commented 1 year ago

My use-case is related to mistakes I've done in past lives with caching layers, by either setting a longer TTL on a key than intended or poisoning the cache by choosing the wrong hash key. While the first scenario can be mitigated with the current remove functionality, if there are too many keys to remove, the ability to flush the cache completely without restarting the service is a useful tool to have around. As for the second use-case, there's no way around it than to flush the cache.

To mitigate an old value surviving the purge, I've added a new field I've uninspiredly called generation. The server will honor requests only if they're coming from a client on the same generation. I'm not 100% certain the way I implemented it doesn't still allow keys to transcend a flush, so I'm trying to write a proper test for this. And it may be that the extra field and complexity is not worth it. It's like cache invalidation is a hard problem. :)

thrawn01 commented 1 year ago

Sorry for not comming back to this PR for a while, you wouldn't believe the 2023 I've had so far 😭

I think I follow the use case of the generation.

My understanding is that generation is used when getting a value? For example, If the generation from the requestor doesn't match the generation of the peer then the value should not be retrieved from the peer's cache, but instead be fetched by the getter. This avoids a peer that has already be purged from re-propagating data from a cache that is on an older generation.

However, what happens if there is a peer in the cluster that is on the wrong generation, in that case all requests to that peer will always come from the getter and never from the cache. There is no mechanism to ensure that all peers in the cluster are always on the same generation until a purge is requested. For example, if a purge is requested in an existing cluster and everyone is now on generation 1 and a new peer is added to the cluster. The new peer will be on generation 0 and has no idea that everyone else is on generation 1.

I understand the use case for this type of thing, but these are hard problems to solve. It's one of the reasons Brad originally made this library very simple as apposed to his previous work in memcached. We get away with Set() and Remove() because we added TTL which means eventually the cluster becomes consistent again. Synchronization is a hard problem to solve when performance matters.

The best thing to do when we pollute the cache with bad values is to redeploy the entire application.

I could be convinced we should merge this feature anyway (without the generation stuff) as long as we CLEARLY document that this purge is not guaranteed to purge all cache values in flight. It might be useful for an operator to send purge requests multiple times in hopes of purging in flight data or if the application can pause processing while the purge completes.

Thoughts?

thrawn01 commented 1 year ago

I know it's been a while, but there is renewed interest in this feature. I'm not sure about support for "generations" is needed. In the case I'm thinking about, we should just purge the cache multiple times and hope we got everything. 😆

Also, it would make implementation simpler.

@ct16k Are you interested in this PR anymore? We might make a new PR with your changes and credit you.