Does RMap is aware of cluster restart

lukasz-weareuncapped commented 7 months ago

I have spring app works and running with RMap create from RedissonClient.getMap cached. Once something happens with few redis nodes (we had a forced restart) that cached map does not work anymore. What is the recommended approach in that case? Call redissonClient.getMap each time? As i saw under the hood new RedissonMap is created so could be time and resource consuming. So please advice. Thank you.

mrniko commented 7 months ago

that cached map does not work anymore

Can you clarify here?

lukasz-weareuncapped commented 7 months ago

while reading/writing to map i've got an exception. Unfortunately i have no clue what's the exact exception is due to lack of logs.. but we do read and write on it.

mrniko commented 7 months ago

sorry, without details I'm unable to help. Redisson handles failover and you don't need to call redissonClient.getMap() after that.

lukasz-weareuncapped commented 7 months ago

Now i know much more. I've turned on debug logs and see that redisson keeps trying to reconnect(scaled down redis cluster to 0 and scaled up to repeat that issue). Unfortunately it keeps trying to reconnect to IP that is not active anymore. After reboot the IP has changed.. but looks that somehow it is cached and not able to connect to newly started redis nodes. If keeps trying forever with no luck. Is there any caching mechanism?

`2024-04-26 11:03:16.519 CEST Connection created [addr=redis://locking-redis-node-0.locking-redis-headless.default.svc.cluster.local:26379]

2024-04-26 11:03:16.618 CEST reconnecting RedisPubSubConnection@395557197 [redisClient=[addr=redis://10.1.69.64:6379], channel=[id: 0x3d2eea36, L:/10.1.75.4:52138 ! R:10.1.69.64/10.1.69.64:6379], currentCommand=null, usage=0] to 10.1.69.64/10.1.69.64:6379

2024-04-26 11:03:16.618 CEST Connection created [addr=redis://10.1.69.64:6379]

2024-04-26 11:03:17.818 CEST connection released for command (INFO REPLICATION) and params [] from slot NodeSource [slot=0, addr=null, redisClient=null, redirect=null, entry=null] using connection RedisConnection@1350729690 [redisClient=[addr=redis://10.1.69.64:6379], channel=[id: 0xc72cb405, L:/10.1.75.4:52132 ! R:10.1.69.64/10.1.69.64:6379], currentCommand=null, usage=0]

2024-04-26 11:03:17.818 CEST attempt 3 for command (INFO REPLICATION) and params [] to NodeSource [slot=0, addr=null, redisClient=null, redirect=null, entry=null]

`

mrniko commented 7 months ago

Which Redisson version?

lukasz-weareuncapped commented 7 months ago

redisson client 3.26.1 Configuration we have:

"SPRING_DATA_REDIS_PASSWORD": "123",
  "SPRING_DATA_REDIS_SENTINEL_MASTER": "mymaster",
  "SPRING_DATA_REDIS_SENTINEL_NODES": "locking-redis-headless.default.svc.cluster.local:26379",

on redis 17.13.2 · bitnami/bitnami

mrniko commented 7 months ago

Can you share Redisson config?

Can you set trace logging level for org.redisson package and share the output?

lukasz-weareuncapped commented 7 months ago

redisson configuration is injected via redisson-spring-boot-starter

spring:
  data:
    redis:
      password: mypass
      sentinel:
        master: mymaster
        nodes: locking-redis-headless.default.svc.cluster.local:26379

Traces on the way...

lukasz-weareuncapped commented 7 months ago

Explore-logs-2024-04-26 20_28_22.txt - > Logs from 15 minutes of working application when trying to reconnect (after restarting all slaves)

and logs where redis slaves were working fine.-> Explore-logs-2024-04-26 20_34_00.txt

lukasz-weareuncapped commented 7 months ago

I found one exception related with DNS netty and found the issue solved by netty team that is related with caching DNS https://github.com/netty/netty/issues/11885 which is fixed in 4.1.109 or 4.1.110. (also related with https://github.com/netty/netty/issues/12842 ) while the latest 3.29.0 redisson uses 4.1.104 under the hood

one more exception i'm facing on daily basis... ERROR 2024-04-28T02:10:10.190425043Z Can't execute SENTINEL commands on 10.236.135.253/10.236.135.253:26379 java.util.concurrent.CompletionException: org.redisson.client.RedisTimeoutException: Command execution timeout for command: (SENTINEL GET-MASTER-ADDR-BY-NAME), params: [mymaster], Redis client: [addr=redis://xx:26379] at java.base/java.util.concurrent.CompletableFuture.encodeThrowable(Unknown Source) at java.base/java.util.concurrent.CompletableFuture.completeThrowable(Unknown Source) at java.base/java.util.concurrent.CompletableFuture$UniCompose.tryFire(Unknown Source) at java.base/java.util.concurrent.CompletableFuture.postComplete(Unknown Source) at java.base/java.util.concurrent.CompletableFuture.completeExceptionally(Unknown Source) at org.redisson.client.RedisConnection.lambda$async$0(RedisConnection.java:254) at io.netty.util.HashedWheelTimer$HashedWheelTimeout.run(HashedWheelTimer.java:715) at io.netty.util.concurrent.ImmediateExecutor.execute(ImmediateExecutor.java:34) at io.netty.util.HashedWheelTimer$HashedWheelTimeout.expire(HashedWheelTimer.java:703) at io.netty.util.HashedWheelTimer$HashedWheelBucket.expireTimeouts(HashedWheelTimer.java:790) at io.netty.util.HashedWheelTimer$Worker.run(HashedWheelTimer.java:503) at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30) at java.base/java.lang.Thread.run(Unknown Source) Caused by: org.redisson.client.RedisTimeoutException: Command execution timeout for command: (SENTINEL GET-MASTER-ADDR-BY-NAME), params: [mymaster], Redis client: [addr=redis://xx:26379] at org.redisson.client.RedisConnection.lambda$async$0(RedisConnection.java:253) ... 7 common frames omitted

and

org.redisson.client.RedisTimeoutException: Command execution timeout for command: (AUTH), params: (password masked), Redis client: [addr=redis://locking-redis-node-2.locking-redis-headless.default.svc.cluster.local:26379] java.util.concurrent.CompletionException: org.redisson.client.RedisTimeoutException: Command execution timeout for command: (AUTH), params: (password masked), Redis client: [addr=redis://locking-redis-node-2.locking-redis-headless.default.svc.cluster.local:26379] at java.base/java.util.concurrent.CompletableFuture.encodeRelay(Unknown Source) at java.base/java.util.concurrent.CompletableFuture.completeRelay(Unknown Source) at java.base/java.util.concurrent.CompletableFuture$UniRelay.tryFire(Unknown Source) at java.base/java.util.concurrent.CompletableFuture.postComplete(Unknown Source) at java.base/java.util.concurrent.CompletableFuture.completeExceptionally(Unknown Source) at org.redisson.client.RedisConnection.lambda$async$0(RedisConnection.java:254) at io.netty.util.HashedWheelTimer$HashedWheelTimeout.run(HashedWheelTimer.java:715) at io.netty.util.concurrent.ImmediateExecutor.execute(ImmediateExecutor.java:34) at io.netty.util.HashedWheelTimer$HashedWheelTimeout.expire(HashedWheelTimer.java:703) at io.netty.util.HashedWheelTimer$HashedWheelBucket.expireTimeouts(HashedWheelTimer.java:790) at io.netty.util.HashedWheelTimer$Worker.run(HashedWheelTimer.java:503) at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30) at java.base/java.lang.Thread.run(Unknown Source) Caused by: org.redisson.client.RedisTimeoutException: Command execution timeout for command: (AUTH), params: (password masked), Redis client: [addr=redis://locking-redis-node-2.locking-redis-headless.default.svc.cluster.local:26379] at org.redisson.client.RedisConnection.lambda$async$0(RedisConnection.java:253) ... 7 common frames omitted

mrniko commented 7 months ago

while the latest 3.29.0 redisson uses 4.1.104 under the hood

It uses 4.1.109 https://repo1.maven.org/maven2/org/redisson/redisson-parent/3.29.0/redisson-parent-3.29.0.pom

As for timeouts you need to read this https://github.com/redisson/redisson/wiki/16.-FAQ#q-what-is-the-cause-of-redistimeoutexception I'm sure it's a network issue

lukasz-weareuncapped commented 7 months ago

Hi @mrniko Thank you for your advice. We went trough whole network but have not found any problem.

I went trough several tests and had a discovery right now and confirmed that for sure it is not our network problem. I did a scheduled test(places within an app) in single thread checking each 5 minutes via InetAddress resolving all my hosts. Each time we were getting right IP , even after restarting all slaves i got the right IP address (at the same time redisson were not able to reconnect). Based on that I came up with the idea that something has changed on redisson recently and i've tried downgrade to 3.26.0 - and buuum. It started working again as it should. So basically on version 3.29.0 we have same issue on all our applications that is unable to reconnect trying constantly to make a connection. Isn't that existing connections in theconnection pool are reused to "reconnect" them instead of creating new ones? (that is suggestion i've started making when seeing in logs same IP after restarts). Below logs are visible after each time when i did slaves restart only on 3.26.0.. not on 3.29.0.. Now, after reconnect i may see in logs (on 3.26.0):

1 connections initialized for 10.1.65.24/10.1.65.24:6379
master 10.1.65.24/10.1.65.24:6379 used as slave
24 connections initialized for 10.1.65.24/10.1.65.24:6379
sentinel: redis://10.1.65.23:26379 is down
sentinel: redis://10.1.69.26:26379 added
1 connections initialized for locking-redis-node-1.locking-redis-headless.default.svc.cluster.local/10.1.69.26:6379
slave: locking-redis-node-1.locking-redis-headless.default.svc.cluster.local/10.1.69.26:6379 added
24 connections initialized for locking-redis-node-1.locking-redis-headless.default.svc.cluster.local/10.1.69.26:6379

What i've spotted (on 3.26.0) that log "connections initialized for" is printed in logs each time when we restart nodes. So reconnect

mrniko commented 6 months ago

Can you try 3.27.0 version?

lukasz-weareuncapped commented 5 months ago

But if i'm not mistaken we already had some tests with 3.27.0 version. Anyway, it's bean a while since i did it so let me bump it up and verify results.

Btw I saw one fix for similar issue we have here -> https://github.com/redisson/redisson/issues/5857 made by you @mrniko. Isn't already released with 3.32.0?

mrniko commented 5 months ago

@lukasz-weareuncapped

Btw I saw one fix for similar issue we have here -> https://github.com/redisson/redisson/issues/5857 made by you @mrniko. Isn't already released with 3.32.0?

Yes

redisson / redisson

Does RMap is aware of cluster restart #5812