redis / lettuce

Advanced Java Redis client for thread-safe sync, async, and reactive usage. Supports Cluster, Sentinel, Pipelining, and codecs.
https://lettuce.io
MIT License
5.31k stars 949 forks source link

Flaky test pingNotAllowedInSubscriptionState #2322

Open sancar opened 1 year ago

sancar commented 1 year ago

Bug Report

Current Behavior

Test pingNotAllowedInSubscriptionState fails randomly in our tests against Upstash Redis. After investigating if our implementation of Redis has a problem, we have concluded that the problem in lettuce side.

Test failure happens in the following assertion

  assertThatThrownBy(() -> TestFutures.getOrTimeout(pubsub.echo("ping"))).isInstanceOf(RedisException.class)
                .hasMessageContaining("not allowed");

Cause Of the Problem

The test fails because of a race condition between two threads. More specifically: The future for subscribe is completed after we are updating the internal state of PubsubEndpoint(adding a new channel to the channels set). Because of that, in some unlucky timing, the test (PubSubCommandHandler#notifyPushListeners) fails.

  1. subscribe future is completed.
  2. new channel is not put into the channels map in PubsubEndpoint
  3. Test continues to send an echo to the server.
  4. Client side for PubSubEndpint#isSubscribed returns false because channels and patterns are empty.
  5. Server runs the command and does not return an error because the client is talking with resp3 and all commands are allowed.
  6. Even if we were talking in resp2 test would fail because of the error message of the exception.

Possible Solution

~Remove local checks for allowed commands and rely on the exception from redis.~ ~This way, we can both remove the bug and simplify the library.~ ~More details in pr that I have prepared for the proposed solution:~ ~https://github.com/lettuce-io/lettuce-core/pull/2314~ This solution is eliminated because the local checks are needed see https://github.com/lettuce-io/lettuce-core/issues/579

tishun commented 1 week ago

Is this issue still relevant?

sancar commented 1 week ago

Hi @tishun,

Is this a question for me? I'll need to read the description again(I probably already forgot)and check the lettuce code to see if there have been any changes. If you're one of the maintainers, I'd appreciate it if you could handle that. I'll try to look into it if I get some free time next week, but I can't make any promises.

In our test suite (which includes a copy of your test suite) against our Redis implementation, we decided to skip that test and moved on.

tishun commented 1 week ago

Hi @tishun,

Is this a question for me?

Hey, sorry for not addressing you directly, yes, it was meant for you.

If you're one of the maintainers, I'd appreciate it if you could handle that.

Trying to wrap my head around this. What you are saying is that there is some delay between the time you subscribe to a channel and the time the logic that disables sending commands kicks in. As a result sometimes you would get the RedisException and some times you wont. Is that mostly correct?

sancar commented 1 week ago

Yes, it sounds correct. If the codebase is still same, channels map in PubsubEndpoint is updated in the background in the background after the subscribe future is completed. Depending on whether that works first , or the test does pubsub.echo("ping") first(which checks PubSubEndpint#isSubscribed internally), the exception behavior changes.

tishun commented 6 days ago

Yes, it sounds correct. If the codebase is still same, channels map in PubsubEndpoint is updated in the background in the background after the subscribe future is completed. Depending on whether that works first , or the test does pubsub.echo("ping") first(which checks PubSubEndpint#isSubscribed internally), the exception behavior changes.

Well since your report quite some things have changed in the code including the test case itself.

2594 changed how RESP3 is handled, and even before that #1601 changed the way this test works.

I suggest that you try and update your code with the latest changes and see if you have issues with them. Our CI/CD runs this test daily and we haven't caught any issues.

sancar commented 6 days ago

Ok, thanks for the update. We will do so.