redis / lettuce

Advanced Java Redis client for thread-safe sync, async, and reactive usage. Supports Cluster, Sentinel, Pipelining, and codecs.
https://lettuce.io
MIT License
5.37k stars 960 forks source link

Responses getting out of sync with requests #2012

Open jkern888 opened 2 years ago

jkern888 commented 2 years ago

Current Behavior

We've recently updated lettuce to 6.1.1.RELEASE and after running it in our production environment for a week we ran into a troubling issue. At some point it looks like lettuce worked itself into a state where commands were receiving responses to different requests. This manifested as lots of deserialization errors (e.g. java.lang.UnsupportedOperationException: io.lettuce.core.output.ValueOutput does not support set(long)), but also more seriously as incorrect data being returned to the caller.

Restarting the service brought things back to a working state. Prior to this happening the only other thing we observed were some RedisCommandTimeoutException being thrown, but not at an anomalous level.

For the meantime we've added some more aggressive checking for this happening, so if it does reoccur I'd be happy to try and gather more information, but not sure what would be helpful?

Environment

mp911de commented 2 years ago

Depending on how you use the library this might happen when using e.g. a single connection with transactions from multiple threads without proper synchronization.

jkern888 commented 2 years ago

@mp911de Thanks for the comment. I don't think that's the case here as our usage is pretty uncomplicated, just using the synchronous api to do get, set, expire, del from a single connection. We do also use the compression codec to zip the data, along with jackson for de/serializtion. But all of those claim to be threadsafe.

The fact this could happen from incorrect usage like you said is interesting though. Is there a particular part of the library you could point me to? May help in debugging since possibly there's some other path into the same state?

mp911de commented 2 years ago

Hum, when using Jackson as Codec, then encoding errors may bring the protocol out of sync. Please check whether you can see some exceptions prior to getting out of sync.

jkern888 commented 2 years ago

@mp911de Interesting. Here's the code for the codec we're using. Should we be swallowing the exception and returning null?

public class JacksonRedisCodec<T> implements RedisCodec<String, T> {

    private final ObjectWriter writer;
    private final ObjectReader reader;

    // Encoding used for the key string.
    private static final Charset UTF_8 = Charsets.UTF_8;

    public JacksonRedisCodec(ObjectWriter writer, ObjectReader reader) {
        this.writer = writer;
        this.reader = reader;
    }

    @Override
    public String decodeKey(ByteBuffer byteBuffer) {
        ByteBufferBackedInputStream inputStream = new ByteBufferBackedInputStream(byteBuffer);
        StringWriter writer = new StringWriter();
        try {
            IOUtils.copy(inputStream, writer, UTF_8.name());
        }
        catch (IOException e) {
            throw new RuntimeException(e);
        }
        return writer.toString();
    }

    @Override
    public T decodeValue(ByteBuffer byteBuffer) {
        ByteBufferBackedInputStream inputStream = new ByteBufferBackedInputStream(byteBuffer);
        try {
            return reader.readValue(inputStream);
        }
        catch (IOException e) {
            throw new RuntimeException(e);
        }
    }

    @Override
    public ByteBuffer encodeKey(String string) {
        return ByteBuffer.wrap(string.getBytes(UTF_8));
    }

    @Override
    public ByteBuffer encodeValue(T t) {
        try {
            return ByteBuffer.wrap(writer.writeValueAsBytes(t));
        }
        catch (JsonProcessingException e) {
            throw new RuntimeException(e);
        }
    }
} 
mp911de commented 2 years ago

Any exceptions that happen on the codec level affect directly what is being sent to the Redis server. If you have say a command sequence of SET -> SET -> GET and the second SET throws an exception (SET -> 💥 -> GET), then the client is no longer synchronized to the protocol as the second command was never sent to the Redis server.

We try to recover as much as possible upon encoding/decoding but there's always a fair chance that exceptions during encoding/decoding may impact the protocol synchronization.

I strongly suggest moving the encoding/decoding part into a higher level component so that you can handle exceptions properly.

jkern888 commented 2 years ago

The same issue has just re-occurred. Digging into our logging I'm seeing no exceptions around the codecs. The only thing I do see is a command timeout:

 io.lettuce.core.RedisCommandTimeoutException: Command timed out after 1 second(s)
    at io.lettuce.core.internal.ExceptionFactory.createTimeoutException(ExceptionFactory.java:53)
    at io.lettuce.core.protocol.CommandExpiryWriter.lambda$potentiallyExpire$0(CommandExpiryWriter.java:172)
    at io.netty.util.concurrent.PromiseTask.runTask(PromiseTask.java:98)
    at io.netty.util.concurrent.ScheduledFutureTask.run(ScheduledFutureTask.java:170)
    at io.netty.util.concurrent.DefaultEventExecutor.run(DefaultEventExecutor.java:66)
    at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:989)
    at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
    at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
    at java.base/java.lang.Thread.run(Thread.java:832)

Immediately after that we start getting many errors due to responses being de-synced. We see that same timeout exception at other times with no ill effect, but maybe it's occasionally triggering this bad state?

bytestreme commented 2 years ago

Possible result of being out of sync https://github.com/lettuce-io/lettuce-core/issues/2175

rndmania commented 2 years ago

@jkern888 I have a question. I meet same problem but not related with codec. you said problem was occured when upgrade to 6.1.1. So, I Wondering which version is your last succeed version?

previously, I use lettuce v5.x and this version does not have this problem.

qinchunabng commented 1 year ago

We have got the same problem. And I reproduced the problem by using below codes:

@SpringBootTest
@TestInstance(TestInstance.Lifecycle.PER_CLASS)
public class RedisTest {

    @Autowired
    private ReactiveRedisTemplate redisTemplate;

    private ExecutorService executorService = Executors.newFixedThreadPool(20);

    @Test
    public void testRedis() throws Exception{
        Random random = new Random();
        AtomicInteger counter = new AtomicInteger(0);
        for (int i = 0; i < 100000; i++) {
            int finalI = i;
            executorService.submit(() -> {
                redisTemplate.opsForValue().set("testkey", "testvalue" + finalI)
                        //.publishOn(Schedulers.boundedElastic())
                        .flatMap(r1 -> {
                            System.out.println(Thread.currentThread().getName() + " : result1 = " + r1);
                            return redisTemplate.opsForHash().get("test_hash", "hash_field") //, "test_value" + finalI)
                                   // .publishOn(Schedulers.boundedElastic())
                                    .defaultIfEmpty("")
                                    .flatMap(r2 -> {
                                        if(random.nextInt() % 2 == 0 && counter.get() < 100){
                                            counter.incrementAndGet();
                                            throw new OutOfMemoryError("test out of memory error");
                                        }
                                        System.out.println(Thread.currentThread().getName() + " : result2 = " + r2);
                                        return redisTemplate.opsForHash().put("test_hash", "hash_field", "hash_value" + finalI)
                                                //.publishOn(Schedulers.boundedElastic())
                                                .flatMap(r3 -> {
                                                            System.out.println(Thread.currentThread().getName() + " : result3 = " + r3);
                                                            return redisTemplate.opsForValue()
                                                                    .get("testkey")
                                                                    //.publishOn(Schedulers.boundedElastic())
                                                                    .map(r4 -> {
                                                                        System.out.println(Thread.currentThread().getName() + " : result4 = " + r4);
                                                                        return r4;
                                                                    });
                                                        });

                                    });
                        })
//                        .subscribeOn(Schedulers.boundedElastic())
                        .subscribe();
            });
            TimeUnit.MILLISECONDS.sleep(10);

        }
        System.in.read();
    }

}

My environment is:

I found if my business codes throw errors, it will be catched by the lettuce pool thread.While this will finally lead to java.lang.UnsupportedOperationException: io.lettuce.core.output.ValueOutput does not support set(long) exception. And If I indicate a thread pool to my business codes by publishOn, the error won't be caught by lettuce thread.This can void the exception happening.

rndmania commented 1 year ago

@qinchunabng Thanks for sharing. I have one question. You example code has two comments. one is 'publishOn' and other is 'subscribeOn' To avoid exception happening, needs uncomment both?

qinchunabng commented 1 year ago

@qinchunabng Thanks for sharing. I have one question. You example code has two comments. one is 'publishOn' and other is 'subscribeOn' To avoid exception happening, needs uncomment both?

Just uncomment 'publishOn'. 'subscribeOn' is testing code.

KamToHung commented 1 year ago

@qinchunabng Thanks for sharing. I have one question. You example code has two comments. one is 'publishOn' and other is 'subscribeOn' To avoid exception happening, needs uncomment both?

Just uncomment 'publishOn'. 'subscribeOn' is testing code.

hi bro, how to resolve

qinchunabng commented 1 year ago

@qinchunabng Thanks for sharing. I have one question. You example code has two comments. one is 'publishOn' and other is 'subscribeOn' To avoid exception happening, needs uncomment both?

Just uncomment 'publishOn'. 'subscribeOn' is testing code.

hi bro, how to resolve Using publishOn indicates the business code in flatMap executing in other thread pool not in lettuce's thread pool. Through my test if your code run in lettuce's thread pool and it has some exceptions of Error occuring,there will probably result out of sync.

wqq0712 commented 1 year ago

Our team also encountered this sporadic issue, which occurs approximately once every 1-2 weeks. Through our testing, we found that this problem only occurs on aarch64 servers and is not observed on X86-64 servers. We were able to resolve this issue by adjusting the version of the JDK.

Here is the environmental information: Redis Server: 6.2.5 spring-data-redis: 2.1.20.RELEASE springboot: 2.1.17.RELEASE Lettuce: 6.0.7 Release

Test Result: Oracle JDK 8u131 ---> Exception Oracle JDK 8u201 ---> Exception Oracle JDK 8u251 ---> OK

Open JDK 8u352 ---> OK Open JDK 8u292 ---> OK

rndmania commented 1 year ago

@wqq0712 Thanks for good information.

I have meet this problem with x86-64 server and Open JDK 8u222

So, I think aarch64 and x86-64 is not important condigion. I will try test with various Open JDK version.

Thank you.

mp911de commented 1 year ago

I do not think that the JDK version is involved. Rather, out of sync happens either if a codec fails encoding/decoding or an output fails (or any other exotic condition happens).

Happy to provide support if you can provide a reliable reproducer.

FULaBUla commented 7 months ago

@wqq0712 Thanks for good information.

I have meet this problem with x86-64 server and Open JDK 8u222

So, I think aarch64 and x86-64 is not important condigion. I will try test with various Open JDK version.

Thank you.

@wqq0712 Thanks for good information.

I have meet this problem with x86-64 server and Open JDK 8u222

So, I think aarch64 and x86-64 is not important condigion. I will try test with various Open JDK version.

Thank you.

@rndmania

I think maybe there is a bug in the 8u222 version, which I also use on the aarch64 platform. The problem stopped when I upgraded to 8u341.