mailgun / groupcache

Clone of golang/groupcache with TTL and Item Removal support
Apache License 2.0
484 stars 72 forks source link

Distributed deadlock possible when peer lists are out of sync #72

Open justinrixx opened 2 weeks ago

justinrixx commented 2 weeks ago

Hi there.

This is an edge case I ran into which can cause a specific key to hang forever without resolving. The peer list is a consistent hash based on pool membership. When pool membership changes, who owns which key may change. This can happen in a lot of scenarios: a rolling restart as part of a deployment, scaling out/in, etc. If a request comes in at just the right time and you're unlucky, it may go something like this:

The solution to this is to be a little bit smarter when handling crosstalk requests. As written, the code prioritizes making the request to the correct peer over all else. Instead, I propose adding an additional rule: if a request for an object came in on the crosstalk endpoint, it's because a peer believes I own the object. Whether I believe I know who owns it or not, I should not make another crosstalk request.

Or more succinctly: requests which came in on the crosstalk ingress must not make egress crosstalk requests.

Please have a look at this readme and code where I have written up a workaround for this issue.

I propose a fix be put into place within this project though so that the workaround is not needed. One idea is to have an alternative version of group.Get() which takes a boolean flag to control whether egress crosstalk requests are allowed. That alternative version would be called in the HTTP handler here.

If the proposed fix sounds reasonable and worth pursuing to you, I would be willing to put together a PR.

Please feel free to ask anything you need clarification on. This edge case had me pulling my hair out and I'd like to save others the pain. The original golang/groupcache project is not active, and this is the most active fork, which is why I've come here.