Open grzechum opened 2 years ago
Heads up that we were able to effectively work around this issue by increasing our timeout. I guess that lowered the probability of getting into this state enough that we haven't hit it again
Thank you, we do not use custom codecs so this has to be triggered by timeout.
Thank you for reporting issue. I have same problem. (e.g. key-A value is returned as result of key-B GET.) Do you have any solution?
We encountered similar command-out-of-sync issues on multiple instances in our cluster, we don't have a custom codec, looking back at the error logs, we found out most instances threw a CancellationException
at the beginning, then started to get more UnsupportedOperationException
gradually. We had to restart those instances to recover.
Our env Lettuce: 6.2.1 Redis Cluster: 6
java.util.concurrent.CompletionException: java.util.concurrent.CancellationException at java.base/java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:314) at java.base/java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:319) at java.base/java.util.concurrent.CompletableFuture$AsyncRun.run(CompletableFuture.java:1739) at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) at java.base/java.lang.Thread.run(Thread.java:829) Caused by: java.util.concurrent.CancellationException at java.base/java.util.concurrent.CompletableFuture.cancel(CompletableFuture.java:2396) at io.lettuce.core.protocol.AsyncCommand.cancel(AsyncCommand.java:146) at io.lettuce.core.protocol.AsyncCommand.cancel(AsyncCommand.java:180) at io.lettuce.core.protocol.CommandWrapper.cancel(CommandWrapper.java:111) at io.lettuce.core.protocol.CommandWrapper.cancel(CommandWrapper.java:111) at io.lettuce.core.protocol.CommandHandler.cancelCommands(CommandHandler.java:958) at io.lettuce.core.protocol.CommandHandler.reset(CommandHandler.java:938) at io.lettuce.core.protocol.CommandHandler.userEventTriggered(CommandHandler.java:257) at io.netty.channel.AbstractChannelHandlerContext.invokeUserEventTriggered(AbstractChannelHandlerContext.java:346) at io.netty.channel.AbstractChannelHandlerContext.invokeUserEventTriggered(AbstractChannelHandlerContext.java:332) at io.netty.channel.AbstractChannelHandlerContext.fireUserEventTriggered(AbstractChannelHandlerContext.java:324) at io.netty.channel.DefaultChannelPipeline$HeadContext.userEventTriggered(DefaultChannelPipeline.java:1428) at io.netty.channel.AbstractChannelHandlerContext.invokeUserEventTriggered(AbstractChannelHandlerContext.java:346) at io.netty.channel.AbstractChannelHandlerContext.access$500(AbstractChannelHandlerContext.java:61) at io.netty.channel.AbstractChannelHandlerContext$6.run(AbstractChannelHandlerContext.java:337) at io.netty.util.concurrent.AbstractEventExecutor.runTask(AbstractEventExecutor.java:174) at io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:167) at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:470) at io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop.java:391) at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:997) at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74) at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30) ... 1 more
We have exactly same problem, after reboot microservice 5 or 6 day, suddenly many exceptions blow out in logs. have no idea.
like floor e.g. key-A value is returned as result of key-B GET.)
Thank you for reporting issue. I have same problem. (e.g. key-A value is returned as result of key-B GET.) Do you have any solution?
Response mix-up can happen when custom codecs throw exceptions or when transactions are used from a shared connection by multiple threads. Are you able to reproduce the issue?
Response mix-up can happen when custom codecs throw exceptions or when transactions are used from a shared connection by multiple threads. Are you able to reproduce the issue?
@mp911de thankyou for reply. It is hard to reproduce the issue, first time we found it in production env.Then we want to reproduce in pressure test env, but failed after many times attempt.
our scenario like below
┌────────────────────────────┐
│ Srv 1 │
│ ┌─────────────────────┐ │
│ │Thread Pool │ │ ┌─────────┐
│ │ │ │ │ Srv n │
│ │t1 t2 t3 ...│ │ ............ │ │
│ └─────────────────────┘ │ └────┬────┘
│ set set set │ │
│ get get get │ │
│ lua.. lua.. lua.. │ │
│ │ │ │ │ │
│ │ │ │ │ │
│ │ │ │ │ │
│ ▼ ▼ ▼ │ │
│ ┌─────────────────┐ │ │
│ │Spring data redis│ │ │
│ └───────┬─────────┘ │ │
│ │ │ │
└──────────┼─────────────────┘ │
│ │
│ │
│ ┌────────────────────┐ │
│Hash(key) │ │ │
├───────────────────┤ Redis Cluster 0 │◄───┤ Hash(key)
│ │ │ │
│ └────────────────────┘ │
│ │
│ ┌────────────────────┐ │
│ │ │ │
└──────────────────►│ Redis Cluster 1 │◄───┘
│ │
└────────────────────┘
Lettuce 6.2.3
shareNativeConnection
== true
pipeliningpolicy == flushOnClose
maxidleconnection == 10
our redis cluster are two 4C8G * 16 redis cluster our service have 160+ pod with 4C8G hardware
our request is tcp, avg qps is 20K ~ 30K for each pod
The isuee bother me for 3 month and totally have no idea.
so far no real harm for our bussiness, just ERROR LOGS, but we cant for sure about our terminal users.
The logs are wired, there are two kind of exception, first is the ValueOutput does not support set(long), the other is lettuce throw a ERR unknown command $12 error
I observe many times , the error log will magicly happend after 5 or 6 days from we boot the service. first, dozens of pods, then 20 ~ 30, after 2 weeks 40 ~ 50.
Once we restart the service, the error magicly disappear until another 5 days.
ps: we use spring data redis, we use pipeline and lua script, but no transaction
We have same problem
lettuce 6.1.10
To be more accurate, we encountered the issue on Lettuce 6.2.2 Redis Cluster 6.0.x
More information:
everytime first couple exception like this:
org.springframework.data.redis.RedisSystemException: Error in execution; nested exception is io.lettuce.core.RedisCommandExecutionException: ERR unknown command `xxxxxxxxxxxxx`, with args beginning with:
org.springframework.data.redis.RedisSystemException: Error in execution; nested exception is io.lettuce.core.RedisCommandExecutionException: ERR unknown command `$18`, with args beginning with:
org.springframework.data.redis.RedisSystemException: Error in execution; nested exception is io.lettuce.core.RedisCommandExecutionException: ERR unknown command `xxxxxxxxxxxxx`, with args beginning with:
and xxxxxxxxxxxxx
actually is value
, not command.
also many error like this, looks like send wrong data in script format
org.springframework.data.redis.RedisSystemException: Error in execution; nested exception is io.lettuce.core.RedisNoScriptException: NOSCRIPT No matching script. Please use EVAL.
Hello guys, any news about this problem? I am also having this issue on Lettuce 6.0.9 and Redis Cluster 7.0.x
Hi folks, just chiming in to say we just encountered this issue with spring-boot-starter-data-redis:3.1.5
(lettuce-core:6.2.6.RELEASE
) talking to an AWS Elasticache Redis instance.
Response mix-up can happen when custom codecs throw exceptions or when transactions are used from a shared connection by multiple threads. Are you able to reproduce the issue?
@mp911de thankyou for reply. It is hard to reproduce the issue, first time we found it in production env.Then we want to reproduce in pressure test env, but failed after many times attempt.
our scenario like below
┌────────────────────────────┐ │ Srv 1 │ │ ┌─────────────────────┐ │ │ │Thread Pool │ │ ┌─────────┐ │ │ │ │ │ Srv n │ │ │t1 t2 t3 ...│ │ ............ │ │ │ └─────────────────────┘ │ └────┬────┘ │ set set set │ │ │ get get get │ │ │ lua.. lua.. lua.. │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ ▼ ▼ ▼ │ │ │ ┌─────────────────┐ │ │ │ │Spring data redis│ │ │ │ └───────┬─────────┘ │ │ │ │ │ │ └──────────┼─────────────────┘ │ │ │ │ │ │ ┌────────────────────┐ │ │Hash(key) │ │ │ ├───────────────────┤ Redis Cluster 0 │◄───┤ Hash(key) │ │ │ │ │ └────────────────────┘ │ │ │ │ ┌────────────────────┐ │ │ │ │ │ └──────────────────►│ Redis Cluster 1 │◄───┘ │ │ └────────────────────┘
Lettuce 6.2.3
shareNativeConnection
== true pipeliningpolicy == flushOnClose maxidleconnection == 10our redis cluster are two 4C8G * 16 redis cluster our service have 160+ pod with 4C8G hardware
our request is tcp, avg qps is 20K ~ 30K for each pod
The isuee bother me for 3 month and totally have no idea.
so far no real harm for our bussiness, just ERROR LOGS, but we cant for sure about our terminal users.
The logs are wired, there are two kind of exception, first is the ValueOutput does not support set(long), the other is lettuce throw a ERR unknown command $12 error
I observe many times , the error log will magicly happend after 5 or 6 days from we boot the service. first, dozens of pods, then 20 ~ 30, after 2 weeks 40 ~ 50.
Once we restart the service, the error magicly disappear until another 5 days.
ps: we use spring data redis, we use pipeline and lua script, but no transaction
update: last month, we alter our entire cluster to a new Cloud IaaS Vendor, then problem never happended again. Totally have no idea.
Response mix-up can happen when custom codecs throw exceptions or when transactions are used from a shared connection by multiple threads. Are you able to reproduce the issue?
@mp911de thankyou for reply. It is hard to reproduce the issue, first time we found it in production env.Then we want to reproduce in pressure test env, but failed after many times attempt. our scenario like below
┌────────────────────────────┐ │ Srv 1 │ │ ┌─────────────────────┐ │ │ │Thread Pool │ │ ┌─────────┐ │ │ │ │ │ Srv n │ │ │t1 t2 t3 ...│ │ ............ │ │ │ └─────────────────────┘ │ └────┬────┘ │ set set set │ │ │ get get get │ │ │ lua.. lua.. lua.. │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ ▼ ▼ ▼ │ │ │ ┌─────────────────┐ │ │ │ │Spring data redis│ │ │ │ └───────┬─────────┘ │ │ │ │ │ │ └──────────┼─────────────────┘ │ │ │ │ │ │ ┌────────────────────┐ │ │Hash(key) │ │ │ ├───────────────────┤ Redis Cluster 0 │◄───┤ Hash(key) │ │ │ │ │ └────────────────────┘ │ │ │ │ ┌────────────────────┐ │ │ │ │ │ └──────────────────►│ Redis Cluster 1 │◄───┘ │ │ └────────────────────┘
Lettuce 6.2.3
shareNativeConnection
== true pipeliningpolicy == flushOnClose maxidleconnection == 10 our redis cluster are two 4C8G * 16 redis cluster our service have 160+ pod with 4C8G hardware our request is tcp, avg qps is 20K ~ 30K for each pod The isuee bother me for 3 month and totally have no idea. so far no real harm for our bussiness, just ERROR LOGS, but we cant for sure about our terminal users. The logs are wired, there are two kind of exception, first is the ValueOutput does not support set(long), the other is lettuce throw a ERR unknown command $12 error I observe many times , the error log will magicly happend after 5 or 6 days from we boot the service. first, dozens of pods, then 20 ~ 30, after 2 weeks 40 ~ 50. Once we restart the service, the error magicly disappear until another 5 days. ps: we use spring data redis, we use pipeline and lua script, but no transactionupdate: last month, we alter our entire cluster to a new Cloud IaaS Vendor, then problem never happended again. Totally have no idea.
Hello @4kingRAS! Could you mention which vendor you used when this happened?
Response mix-up can happen when custom codecs throw exceptions or when transactions are used from a shared connection by multiple threads. Are you able to reproduce the issue?
@mp911de thankyou for reply. It is hard to reproduce the issue, first time we found it in production env.Then we want to reproduce in pressure test env, but failed after many times attempt. our scenario like below
┌────────────────────────────┐ │ Srv 1 │ │ ┌─────────────────────┐ │ │ │Thread Pool │ │ ┌─────────┐ │ │ │ │ │ Srv n │ │ │t1 t2 t3 ...│ │ ............ │ │ │ └─────────────────────┘ │ └────┬────┘ │ set set set │ │ │ get get get │ │ │ lua.. lua.. lua.. │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ ▼ ▼ ▼ │ │ │ ┌─────────────────┐ │ │ │ │Spring data redis│ │ │ │ └───────┬─────────┘ │ │ │ │ │ │ └──────────┼─────────────────┘ │ │ │ │ │ │ ┌────────────────────┐ │ │Hash(key) │ │ │ ├───────────────────┤ Redis Cluster 0 │◄───┤ Hash(key) │ │ │ │ │ └────────────────────┘ │ │ │ │ ┌────────────────────┐ │ │ │ │ │ └──────────────────►│ Redis Cluster 1 │◄───┘ │ │ └────────────────────┘
Lettuce 6.2.3
shareNativeConnection
== true pipeliningpolicy == flushOnClose maxidleconnection == 10 our redis cluster are two 4C8G * 16 redis cluster our service have 160+ pod with 4C8G hardware our request is tcp, avg qps is 20K ~ 30K for each pod The isuee bother me for 3 month and totally have no idea. so far no real harm for our bussiness, just ERROR LOGS, but we cant for sure about our terminal users. The logs are wired, there are two kind of exception, first is the ValueOutput does not support set(long), the other is lettuce throw a ERR unknown command $12 error I observe many times , the error log will magicly happend after 5 or 6 days from we boot the service. first, dozens of pods, then 20 ~ 30, after 2 weeks 40 ~ 50. Once we restart the service, the error magicly disappear until another 5 days. ps: we use spring data redis, we use pipeline and lua script, but no transactionupdate: last month, we alter our entire cluster to a new Cloud IaaS Vendor, then problem never happended again. Totally have no idea.
Hello @4kingRAS! Could you mention which vendor you used when this happened?
a local chinese company, IaaS company often make secondary development on redis, it is a sally port
Folks, providing a minimal sample would greatly increase the chances of fixing the issue (if there is an issue with the driver itself). Right now we do not have such a scenario and we can't do anything about this problem.
Hello @tishun, the problem is that we can't do it. This issue is not reproducible! I implemented a guardrail in my code because it happens randomly. It occurred months ago the last time, but we didn't do anything to stop it. Before that, it happened two weeks prior. I don't understand what's going on.
Hello @tishun, the problem is that we can't do it. This issue is not reproducible! I implemented a guardrail in my code because it happens randomly. It occurred months ago the last time, but we didn't do anything to stop it. Before that, it happened two weeks prior. I don't understand what's going on.
more details, we move our service to tencent cloud, the problem is gone. The main different between tencent redis and old cloud verdor is that the tencent redis is a proxy mode, see https://redis.io/blog/redis-enterprise-proxy/
hope it's help you
Hello @tishun, the problem is that we can't do it. This issue is not reproducible! I implemented a guardrail in my code because it happens randomly. It occurred months ago the last time, but we didn't do anything to stop it. Before that, it happened two weeks prior. I don't understand what's going on.
Like your situation, it happens randomly, cant reproduce in UAT.
Or maybe someone with enough resources could build a cluster with dozen clients connect directly to multipule redis cluster, and high freq using GET and HGET. Made a long duration load test, maybe reproduce, lol
I ran into this issue recently and wanted to post some insights here to help contributors identify the root cause.
There were three main symptoms I saw:
java.lang.UnsupportedOperationException: io.lettuce.core.output.ValueOutput does not support set(long)
(DEL command response)I looked through the following thread https://github.com/redis/lettuce/issues/2012 which gives insight into where the problem occurs. That seems to be on the output of the commands after some unhandled exception disrupts the client.
I was able to reproduce the issue in my environment by doing a fault injection test that simulated packet loss to Redis. At the same time I ran a load test against the service to simulate traffic (GET/SET commands mostly).
Once the Lettuce client gets into a bad state it stays there until restarted. It does not seem to auto-recover.
Environment Lettuce version: 6.1.5
I ran into this issue recently and wanted to post some insights here to help contributors identify the root cause.
There were three main symptoms I saw:
- GET commands would return values from different keys.
- SET commands threw errors:
java.lang.UnsupportedOperationException: io.lettuce.core.output.ValueOutput does not support set(long)
(DEL command response)- GET commands returning 'OK' (SET command response)
I looked through the following thread #2012 which gives insight into where the problem occurs. That seems to be on the output of the commands after some unhandled exception disrupts the client.
I was able to reproduce the issue in my environment by doing a fault injection test that simulated packet loss to Redis. At the same time I ran a load test against the service to simulate traffic (GET/SET commands mostly).
Once the Lettuce client gets into a bad state it stays there until restarted. It does not seem to auto-recover.
Environment Lettuce version: 6.1.5
Could you share (parts of if not the entire) code you used to simulate the issue?
Unfortunately I am unable to post the code.
The service uses the Reactive client connection. Then it is wrapped in a reactive framework Vert.x.
It may partly be an issue with threading, but I'm not familiar enough with the Lettuce implementation to say that with confidence. Not all commands get mixed up when the issue occurs, it is a small percentage of overall traffic.
The repro is not 100% successful, but I only had time to repro it 3 times. It seems to start right AFTER the simulated packet loss to Redis is finished.
This is essentially how the wrapper looks for all the commands GET/SET/DEL. The service does not spin up a new thread prior to this call.
public Future<String> set(String key, String value) {
final Context context = Vertx.currentContext();
if (context != null) {
return Future.fromCompletionStage(StatefulRedisClusterConnection.async().set(key, value), context);
} else {
return Future.fromCompletionStage(StatefulRedisClusterConnection.async().set(key, value));
}
}
Hello, everyone. Could you please share your environment?
I encountered this issue when using JDK 8u222 and spring-boot-data-redis-reactive. Currently, I have updated my stack to jdk17 (Coretto) and spring-boot-data-redis. (non reactive) and it has now been a year without any issues.
Hi everyone. The problem still happens on the lettuce 6.1.10 release.
The problem is intermittent; however, it impacts the system.
Response mix-up can happen when custom codecs throw exceptions or when transactions are used from a shared connection by multiple threads. Are you able to reproduce the issue?
For anyone hitting this issue - please start by checking up this comment.
Bug Report
Current Behavior
We have a lot of miscroservices which use lattuce as redis client. Sometimes (once per 1-2 weeks) lettuce in on of the instances starts to rise exception like this:
Stack trace
```java java.lang.UnsupportedOperationException: io.lettuce.core.output.ValueOutput does not support set(long) at io.lettuce.core.output.CommandOutput.set(CommandOutput.java:107) at io.lettuce.core.protocol.RedisStateMachine.safeSet(RedisStateMachine.java:778) at io.lettuce.core.protocol.RedisStateMachine.handleInteger(RedisStateMachine.java:404) at io.lettuce.core.protocol.RedisStateMachine$State$Type.handle(RedisStateMachine.java:206) at io.lettuce.core.protocol.RedisStateMachine.doDecode(RedisStateMachine.java:334) at io.lettuce.core.protocol.RedisStateMachine.decode(RedisStateMachine.java:295) at io.lettuce.core.protocol.CommandHandler.decode(CommandHandler.java:842) at io.lettuce.core.protocol.CommandHandler.decode0(CommandHandler.java:793) at io.lettuce.core.protocol.CommandHandler.decode(CommandHandler.java:767) at io.lettuce.core.protocol.CommandHandler.decode(CommandHandler.java:659) at io.lettuce.core.protocol.CommandHandler.channelRead(CommandHandler.java:599) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365) at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:357) at io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1410) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365) at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:919) at io.netty.channel.epoll.AbstractEpollStreamChannel$EpollStreamUnsafe.epollInReady(AbstractEpollStreamChannel.java:800) at io.netty.channel.epoll.EpollEventLoop.processReady(EpollEventLoop.java:487) at io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop.java:385) at io.netty.util.concurrent.SingleThreadEve... ```What is even worse, in this case lettuce is mixing values returned from keys, e.g. key-A value is returned as result of key-B GET. I was not able to find any cause so far, no way to reproduce it easily, it just happenes from time to time on one of instances, only instance kill helps.
Input Code
Only simple reactive API: SET/GET/DEL
Expected behavior/code
Environment