redis / go-redis

Redis Go client
https://redis.uptrace.dev
BSD 2-Clause "Simplified" License
20.13k stars 2.37k forks source link

Ring sharding issues after v9.3.1 #2877

Open dgreif opened 9 months ago

dgreif commented 9 months ago

Issue tracker is used for reporting bugs and discussing new features. Please use stackoverflow for supporting issues.

After updating from v9.3.0 to v9.3.1, we started seeing issues where a pipeline to create a Redis Stream is no longer properly setting up the Stream. This issue persists on v9.4.0

Expected Behavior

Redis Ring should be able to run a pipeline to create a stream, then read from the same stream.

Current Behavior

Redis Ring sporadically fails. We run 80+ servers and a small percentage (1-5%) of them seem to write/read from mismatched shards.

Steps to Reproduce

  1. Set up a Ring shard
  2. Run an xgroup command and notice the firstKeyPosition is 1 instead of 2

Context (Environment)

We are running Redis Ring with 3 shards, with 80+ servers reading from them.

Detailed Description

I went through the v9.3.1 release one commit at a time to see which produced the issue, and I landed on https://github.com/redis/go-redis/commit/8c695488a247283d92e4d03c1774d9e2c4583244 as the source of the issue (cc @ofekshenawa as this was from your PR). Once I pinned that down, I focused in on some specific commands we are having issues with:

  1. Pipeline (These are used to create a Stream and set an expiration on it. This is a bit of a hack to get an empty stream created.)
    • xgroup create
    • xgroup destroy
    • expire
  2. exists (this checks to verify the stream exists and isn't expired)

After the changes from https://github.com/redis/go-redis/commit/8c695488a247283d92e4d03c1774d9e2c4583244, these commands execute without errors, but the exists check returns false on a small percentage of our servers. To me, the most likely reason is that the stream was created on a different shard than it's being read from.

I did a little digging locally and noticed that the changes to cmdFirstKeyPos do indeed change the first key position for the xgroup command. When it checked from CommandInfo, it got a key position of 2, but with the removal of CommandInfo, it now receives a key position of 1 (the default return value).

Possible Implementation

In terms of a fix, we could either bring back the CommandInfo being passed in to cmdFirstKeyPos, or we could add another case to the switch within cmdFirstKeyPos to explicitly handle the xgroup command and return 2

dgreif commented 8 months ago

@ofekshenawa any thoughts on the best way to proceed here? Happy to open a PR, just not sure the right way to go in terms of reverting your changes vs making an exception for the xgroup command

notcool11 commented 5 months ago

@dgreif , any progress here? Maybe the issue here is related? https://github.com/redis/go-redis/issues/3009

There certainly seems to be something broken with Rings + Pipelines. Thanks!

dgreif commented 5 months ago

@notcool11 unfortunately I haven't made any progress here due to lack of response from the project maintainers. I'd be happy to open a PR, but I'm not positive what the best approach would be. Looking through your issue, it does sound very similar, especially the idea of data simply not showing up in the shard, likely because it's in a different shard than expected