redis / lettuce

Advanced Java Redis client for thread-safe sync, async, and reactive usage. Supports Cluster, Sentinel, Pipelining, and codecs.
https://lettuce.io
MIT License
5.36k stars 959 forks source link

ValueOutput does not support set(long) #2175

Open grzechum opened 2 years ago

grzechum commented 2 years ago

Bug Report

Current Behavior

We have a lot of miscroservices which use lattuce as redis client. Sometimes (once per 1-2 weeks) lettuce in on of the instances starts to rise exception like this:

Stack trace ```java java.lang.UnsupportedOperationException: io.lettuce.core.output.ValueOutput does not support set(long) at io.lettuce.core.output.CommandOutput.set(CommandOutput.java:107) at io.lettuce.core.protocol.RedisStateMachine.safeSet(RedisStateMachine.java:778) at io.lettuce.core.protocol.RedisStateMachine.handleInteger(RedisStateMachine.java:404) at io.lettuce.core.protocol.RedisStateMachine$State$Type.handle(RedisStateMachine.java:206) at io.lettuce.core.protocol.RedisStateMachine.doDecode(RedisStateMachine.java:334) at io.lettuce.core.protocol.RedisStateMachine.decode(RedisStateMachine.java:295) at io.lettuce.core.protocol.CommandHandler.decode(CommandHandler.java:842) at io.lettuce.core.protocol.CommandHandler.decode0(CommandHandler.java:793) at io.lettuce.core.protocol.CommandHandler.decode(CommandHandler.java:767) at io.lettuce.core.protocol.CommandHandler.decode(CommandHandler.java:659) at io.lettuce.core.protocol.CommandHandler.channelRead(CommandHandler.java:599) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365) at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:357) at io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1410) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365) at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:919) at io.netty.channel.epoll.AbstractEpollStreamChannel$EpollStreamUnsafe.epollInReady(AbstractEpollStreamChannel.java:800) at io.netty.channel.epoll.EpollEventLoop.processReady(EpollEventLoop.java:487) at io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop.java:385) at io.netty.util.concurrent.SingleThreadEve... ```

What is even worse, in this case lettuce is mixing values returned from keys, e.g. key-A value is returned as result of key-B GET. I was not able to find any cause so far, no way to reproduce it easily, it just happenes from time to time on one of instances, only instance kill helps.

Input Code

Only simple reactive API: SET/GET/DEL

Expected behavior/code

Environment

jkern888 commented 2 years ago

Heads up that we were able to effectively work around this issue by increasing our timeout. I guess that lowered the probability of getting into this state enough that we haven't hit it again

grzechum commented 2 years ago

Thank you, we do not use custom codecs so this has to be triggered by timeout.

shuiky commented 1 year ago

Thank you for reporting issue. I have same problem. (e.g. key-A value is returned as result of key-B GET.) Do you have any solution?

jiayuzhang89 commented 1 year ago

We encountered similar command-out-of-sync issues on multiple instances in our cluster, we don't have a custom codec, looking back at the error logs, we found out most instances threw a CancellationException at the beginning, then started to get more UnsupportedOperationException gradually. We had to restart those instances to recover.

Our env Lettuce: 6.2.1 Redis Cluster: 6

java.util.concurrent.CompletionException: java.util.concurrent.CancellationException at java.base/java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:314) at java.base/java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:319) at java.base/java.util.concurrent.CompletableFuture$AsyncRun.run(CompletableFuture.java:1739) at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) at java.base/java.lang.Thread.run(Thread.java:829) Caused by: java.util.concurrent.CancellationException at java.base/java.util.concurrent.CompletableFuture.cancel(CompletableFuture.java:2396) at io.lettuce.core.protocol.AsyncCommand.cancel(AsyncCommand.java:146) at io.lettuce.core.protocol.AsyncCommand.cancel(AsyncCommand.java:180) at io.lettuce.core.protocol.CommandWrapper.cancel(CommandWrapper.java:111) at io.lettuce.core.protocol.CommandWrapper.cancel(CommandWrapper.java:111) at io.lettuce.core.protocol.CommandHandler.cancelCommands(CommandHandler.java:958) at io.lettuce.core.protocol.CommandHandler.reset(CommandHandler.java:938) at io.lettuce.core.protocol.CommandHandler.userEventTriggered(CommandHandler.java:257) at io.netty.channel.AbstractChannelHandlerContext.invokeUserEventTriggered(AbstractChannelHandlerContext.java:346) at io.netty.channel.AbstractChannelHandlerContext.invokeUserEventTriggered(AbstractChannelHandlerContext.java:332) at io.netty.channel.AbstractChannelHandlerContext.fireUserEventTriggered(AbstractChannelHandlerContext.java:324) at io.netty.channel.DefaultChannelPipeline$HeadContext.userEventTriggered(DefaultChannelPipeline.java:1428) at io.netty.channel.AbstractChannelHandlerContext.invokeUserEventTriggered(AbstractChannelHandlerContext.java:346) at io.netty.channel.AbstractChannelHandlerContext.access$500(AbstractChannelHandlerContext.java:61) at io.netty.channel.AbstractChannelHandlerContext$6.run(AbstractChannelHandlerContext.java:337) at io.netty.util.concurrent.AbstractEventExecutor.runTask(AbstractEventExecutor.java:174) at io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:167) at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:470) at io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop.java:391) at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:997) at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74) at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30) ... 1 more

4kingRAS commented 11 months ago

We have exactly same problem, after reboot microservice 5 or 6 day, suddenly many exceptions blow out in logs. have no idea.

like floor e.g. key-A value is returned as result of key-B GET.)

Thank you for reporting issue. I have same problem. (e.g. key-A value is returned as result of key-B GET.) Do you have any solution?

mp911de commented 11 months ago

Response mix-up can happen when custom codecs throw exceptions or when transactions are used from a shared connection by multiple threads. Are you able to reproduce the issue?

4kingRAS commented 11 months ago

Response mix-up can happen when custom codecs throw exceptions or when transactions are used from a shared connection by multiple threads. Are you able to reproduce the issue?

@mp911de thankyou for reply. It is hard to reproduce the issue, first time we found it in production env.Then we want to reproduce in pressure test env, but failed after many times attempt.

our scenario like below

┌────────────────────────────┐
│   Srv 1                    │
│ ┌─────────────────────┐    │
│ │Thread Pool          │    │                      ┌─────────┐
│ │                     │    │                      │   Srv n │
│ │t1      t2     t3 ...│    │      ............    │         │
│ └─────────────────────┘    │                      └────┬────┘
│ set     set     set        │                           │
│ get     get     get        │                           │
│ lua..   lua..   lua..      │                           │
│  │        │       │        │                           │
│  │        │       │        │                           │
│  │        │       │        │                           │
│  ▼        ▼       ▼        │                           │
│  ┌─────────────────┐       │                           │
│  │Spring data redis│       │                           │
│  └───────┬─────────┘       │                           │
│          │                 │                           │
└──────────┼─────────────────┘                           │
           │                                             │
           │                                             │
           │                   ┌────────────────────┐    │
           │Hash(key)          │                    │    │
           ├───────────────────┤   Redis Cluster 0  │◄───┤ Hash(key)
           │                   │                    │    │
           │                   └────────────────────┘    │
           │                                             │
           │                   ┌────────────────────┐    │
           │                   │                    │    │
           └──────────────────►│   Redis Cluster 1  │◄───┘
                               │                    │
                               └────────────────────┘

Lettuce 6.2.3 shareNativeConnection == true pipeliningpolicy == flushOnClose maxidleconnection == 10

our redis cluster are two 4C8G * 16 redis cluster our service have 160+ pod with 4C8G hardware

our request is tcp, avg qps is 20K ~ 30K for each pod

The isuee bother me for 3 month and totally have no idea.

so far no real harm for our bussiness, just ERROR LOGS, but we cant for sure about our terminal users.

The logs are wired, there are two kind of exception, first is the ValueOutput does not support set(long), the other is lettuce throw a ERR unknown command $12 error

I observe many times , the error log will magicly happend after 5 or 6 days from we boot the service. first, dozens of pods, then 20 ~ 30, after 2 weeks 40 ~ 50.

Once we restart the service, the error magicly disappear until another 5 days.

ps: we use spring data redis, we use pipeline and lua script, but no transaction

devilmc8888 commented 11 months ago

We have same problem

lettuce 6.1.10 截取图片_20231016101454

jiayuzhang89 commented 11 months ago

To be more accurate, we encountered the issue on Lettuce 6.2.2 Redis Cluster 6.0.x

4kingRAS commented 11 months ago

More information:

everytime first couple exception like this:

org.springframework.data.redis.RedisSystemException: Error in execution; nested exception is io.lettuce.core.RedisCommandExecutionException: ERR unknown command `xxxxxxxxxxxxx`, with args beginning with: 

org.springframework.data.redis.RedisSystemException: Error in execution; nested exception is io.lettuce.core.RedisCommandExecutionException: ERR unknown command `$18`, with args beginning with: 

org.springframework.data.redis.RedisSystemException: Error in execution; nested exception is io.lettuce.core.RedisCommandExecutionException: ERR unknown command `xxxxxxxxxxxxx`, with args beginning with: 

and xxxxxxxxxxxxx actually is value, not command.

also many error like this, looks like send wrong data in script format

org.springframework.data.redis.RedisSystemException: Error in execution; nested exception is io.lettuce.core.RedisNoScriptException: NOSCRIPT No matching script. Please use EVAL.
erigamonte commented 8 months ago

Hello guys, any news about this problem? I am also having this issue on Lettuce 6.0.9 and Redis Cluster 7.0.x

str1aN commented 8 months ago

Hi folks, just chiming in to say we just encountered this issue with spring-boot-starter-data-redis:3.1.5 (lettuce-core:6.2.6.RELEASE) talking to an AWS Elasticache Redis instance.

4kingRAS commented 8 months ago

Response mix-up can happen when custom codecs throw exceptions or when transactions are used from a shared connection by multiple threads. Are you able to reproduce the issue?

@mp911de thankyou for reply. It is hard to reproduce the issue, first time we found it in production env.Then we want to reproduce in pressure test env, but failed after many times attempt.

our scenario like below

┌────────────────────────────┐
│   Srv 1                    │
│ ┌─────────────────────┐    │
│ │Thread Pool          │    │                      ┌─────────┐
│ │                     │    │                      │   Srv n │
│ │t1      t2     t3 ...│    │      ............    │         │
│ └─────────────────────┘    │                      └────┬────┘
│ set     set     set        │                           │
│ get     get     get        │                           │
│ lua..   lua..   lua..      │                           │
│  │        │       │        │                           │
│  │        │       │        │                           │
│  │        │       │        │                           │
│  ▼        ▼       ▼        │                           │
│  ┌─────────────────┐       │                           │
│  │Spring data redis│       │                           │
│  └───────┬─────────┘       │                           │
│          │                 │                           │
└──────────┼─────────────────┘                           │
           │                                             │
           │                                             │
           │                   ┌────────────────────┐    │
           │Hash(key)          │                    │    │
           ├───────────────────┤   Redis Cluster 0  │◄───┤ Hash(key)
           │                   │                    │    │
           │                   └────────────────────┘    │
           │                                             │
           │                   ┌────────────────────┐    │
           │                   │                    │    │
           └──────────────────►│   Redis Cluster 1  │◄───┘
                               │                    │
                               └────────────────────┘

Lettuce 6.2.3 shareNativeConnection == true pipeliningpolicy == flushOnClose maxidleconnection == 10

our redis cluster are two 4C8G * 16 redis cluster our service have 160+ pod with 4C8G hardware

our request is tcp, avg qps is 20K ~ 30K for each pod

The isuee bother me for 3 month and totally have no idea.

so far no real harm for our bussiness, just ERROR LOGS, but we cant for sure about our terminal users.

The logs are wired, there are two kind of exception, first is the ValueOutput does not support set(long), the other is lettuce throw a ERR unknown command $12 error

I observe many times , the error log will magicly happend after 5 or 6 days from we boot the service. first, dozens of pods, then 20 ~ 30, after 2 weeks 40 ~ 50.

Once we restart the service, the error magicly disappear until another 5 days.

ps: we use spring data redis, we use pipeline and lua script, but no transaction

update: last month, we alter our entire cluster to a new Cloud IaaS Vendor, then problem never happended again. Totally have no idea.

erigamonte commented 8 months ago

Response mix-up can happen when custom codecs throw exceptions or when transactions are used from a shared connection by multiple threads. Are you able to reproduce the issue?

@mp911de thankyou for reply. It is hard to reproduce the issue, first time we found it in production env.Then we want to reproduce in pressure test env, but failed after many times attempt. our scenario like below

┌────────────────────────────┐
│   Srv 1                    │
│ ┌─────────────────────┐    │
│ │Thread Pool          │    │                      ┌─────────┐
│ │                     │    │                      │   Srv n │
│ │t1      t2     t3 ...│    │      ............    │         │
│ └─────────────────────┘    │                      └────┬────┘
│ set     set     set        │                           │
│ get     get     get        │                           │
│ lua..   lua..   lua..      │                           │
│  │        │       │        │                           │
│  │        │       │        │                           │
│  │        │       │        │                           │
│  ▼        ▼       ▼        │                           │
│  ┌─────────────────┐       │                           │
│  │Spring data redis│       │                           │
│  └───────┬─────────┘       │                           │
│          │                 │                           │
└──────────┼─────────────────┘                           │
           │                                             │
           │                                             │
           │                   ┌────────────────────┐    │
           │Hash(key)          │                    │    │
           ├───────────────────┤   Redis Cluster 0  │◄───┤ Hash(key)
           │                   │                    │    │
           │                   └────────────────────┘    │
           │                                             │
           │                   ┌────────────────────┐    │
           │                   │                    │    │
           └──────────────────►│   Redis Cluster 1  │◄───┘
                               │                    │
                               └────────────────────┘

Lettuce 6.2.3 shareNativeConnection == true pipeliningpolicy == flushOnClose maxidleconnection == 10 our redis cluster are two 4C8G * 16 redis cluster our service have 160+ pod with 4C8G hardware our request is tcp, avg qps is 20K ~ 30K for each pod The isuee bother me for 3 month and totally have no idea. so far no real harm for our bussiness, just ERROR LOGS, but we cant for sure about our terminal users. The logs are wired, there are two kind of exception, first is the ValueOutput does not support set(long), the other is lettuce throw a ERR unknown command $12 error I observe many times , the error log will magicly happend after 5 or 6 days from we boot the service. first, dozens of pods, then 20 ~ 30, after 2 weeks 40 ~ 50. Once we restart the service, the error magicly disappear until another 5 days. ps: we use spring data redis, we use pipeline and lua script, but no transaction

update: last month, we alter our entire cluster to a new Cloud IaaS Vendor, then problem never happended again. Totally have no idea.

Hello @4kingRAS! Could you mention which vendor you used when this happened?

4kingRAS commented 7 months ago

Response mix-up can happen when custom codecs throw exceptions or when transactions are used from a shared connection by multiple threads. Are you able to reproduce the issue?

@mp911de thankyou for reply. It is hard to reproduce the issue, first time we found it in production env.Then we want to reproduce in pressure test env, but failed after many times attempt. our scenario like below

┌────────────────────────────┐
│   Srv 1                    │
│ ┌─────────────────────┐    │
│ │Thread Pool          │    │                      ┌─────────┐
│ │                     │    │                      │   Srv n │
│ │t1      t2     t3 ...│    │      ............    │         │
│ └─────────────────────┘    │                      └────┬────┘
│ set     set     set        │                           │
│ get     get     get        │                           │
│ lua..   lua..   lua..      │                           │
│  │        │       │        │                           │
│  │        │       │        │                           │
│  │        │       │        │                           │
│  ▼        ▼       ▼        │                           │
│  ┌─────────────────┐       │                           │
│  │Spring data redis│       │                           │
│  └───────┬─────────┘       │                           │
│          │                 │                           │
└──────────┼─────────────────┘                           │
           │                                             │
           │                                             │
           │                   ┌────────────────────┐    │
           │Hash(key)          │                    │    │
           ├───────────────────┤   Redis Cluster 0  │◄───┤ Hash(key)
           │                   │                    │    │
           │                   └────────────────────┘    │
           │                                             │
           │                   ┌────────────────────┐    │
           │                   │                    │    │
           └──────────────────►│   Redis Cluster 1  │◄───┘
                               │                    │
                               └────────────────────┘

Lettuce 6.2.3 shareNativeConnection == true pipeliningpolicy == flushOnClose maxidleconnection == 10 our redis cluster are two 4C8G * 16 redis cluster our service have 160+ pod with 4C8G hardware our request is tcp, avg qps is 20K ~ 30K for each pod The isuee bother me for 3 month and totally have no idea. so far no real harm for our bussiness, just ERROR LOGS, but we cant for sure about our terminal users. The logs are wired, there are two kind of exception, first is the ValueOutput does not support set(long), the other is lettuce throw a ERR unknown command $12 error I observe many times , the error log will magicly happend after 5 or 6 days from we boot the service. first, dozens of pods, then 20 ~ 30, after 2 weeks 40 ~ 50. Once we restart the service, the error magicly disappear until another 5 days. ps: we use spring data redis, we use pipeline and lua script, but no transaction

update: last month, we alter our entire cluster to a new Cloud IaaS Vendor, then problem never happended again. Totally have no idea.

Hello @4kingRAS! Could you mention which vendor you used when this happened?

https://www.ctyun.cn/

a local chinese company, IaaS company often make secondary development on redis, it is a sally port

tishun commented 5 months ago

Folks, providing a minimal sample would greatly increase the chances of fixing the issue (if there is an issue with the driver itself). Right now we do not have such a scenario and we can't do anything about this problem.

erigamonte commented 5 months ago

Hello @tishun, the problem is that we can't do it. This issue is not reproducible! I implemented a guardrail in my code because it happens randomly. It occurred months ago the last time, but we didn't do anything to stop it. Before that, it happened two weeks prior. I don't understand what's going on.

4kingRAS commented 4 months ago

Hello @tishun, the problem is that we can't do it. This issue is not reproducible! I implemented a guardrail in my code because it happens randomly. It occurred months ago the last time, but we didn't do anything to stop it. Before that, it happened two weeks prior. I don't understand what's going on.

more details, we move our service to tencent cloud, the problem is gone. The main different between tencent redis and old cloud verdor is that the tencent redis is a proxy mode, see https://redis.io/blog/redis-enterprise-proxy/

hope it's help you

4kingRAS commented 4 months ago

Hello @tishun, the problem is that we can't do it. This issue is not reproducible! I implemented a guardrail in my code because it happens randomly. It occurred months ago the last time, but we didn't do anything to stop it. Before that, it happened two weeks prior. I don't understand what's going on.

Like your situation, it happens randomly, cant reproduce in UAT.

Or maybe someone with enough resources could build a cluster with dozen clients connect directly to multipule redis cluster, and high freq using GET and HGET. Made a long duration load test, maybe reproduce, lol

doyouevendevelop commented 2 weeks ago

I ran into this issue recently and wanted to post some insights here to help contributors identify the root cause.

There were three main symptoms I saw:

  1. GET commands would return values from different keys.
  2. SET commands threw errors: java.lang.UnsupportedOperationException: io.lettuce.core.output.ValueOutput does not support set(long) (DEL command response)
  3. GET commands returning 'OK' (SET command response)

I looked through the following thread https://github.com/redis/lettuce/issues/2012 which gives insight into where the problem occurs. That seems to be on the output of the commands after some unhandled exception disrupts the client.

I was able to reproduce the issue in my environment by doing a fault injection test that simulated packet loss to Redis. At the same time I ran a load test against the service to simulate traffic (GET/SET commands mostly).

Once the Lettuce client gets into a bad state it stays there until restarted. It does not seem to auto-recover.

Environment Lettuce version: 6.1.5

tishun commented 1 week ago

I ran into this issue recently and wanted to post some insights here to help contributors identify the root cause.

There were three main symptoms I saw:

  1. GET commands would return values from different keys.
  2. SET commands threw errors: java.lang.UnsupportedOperationException: io.lettuce.core.output.ValueOutput does not support set(long) (DEL command response)
  3. GET commands returning 'OK' (SET command response)

I looked through the following thread #2012 which gives insight into where the problem occurs. That seems to be on the output of the commands after some unhandled exception disrupts the client.

I was able to reproduce the issue in my environment by doing a fault injection test that simulated packet loss to Redis. At the same time I ran a load test against the service to simulate traffic (GET/SET commands mostly).

Once the Lettuce client gets into a bad state it stays there until restarted. It does not seem to auto-recover.

Environment Lettuce version: 6.1.5

Could you share (parts of if not the entire) code you used to simulate the issue?

doyouevendevelop commented 1 week ago

Unfortunately I am unable to post the code.

The service uses the Reactive client connection. Then it is wrapped in a reactive framework Vert.x.

It may partly be an issue with threading, but I'm not familiar enough with the Lettuce implementation to say that with confidence. Not all commands get mixed up when the issue occurs, it is a small percentage of overall traffic.

The repro is not 100% successful, but I only had time to repro it 3 times. It seems to start right AFTER the simulated packet loss to Redis is finished.

This is essentially how the wrapper looks for all the commands GET/SET/DEL. The service does not spin up a new thread prior to this call.

public Future<String> set(String key, String value) {
    final Context context = Vertx.currentContext();
    if (context != null) {
        return Future.fromCompletionStage(StatefulRedisClusterConnection.async().set(key, value), context);
    } else {
        return Future.fromCompletionStage(StatefulRedisClusterConnection.async().set(key, value));
    }
}