redis / redis

Redis is an in-memory database that persists on disk. The data model is key-value, but many different kind of values are supported: Strings, Lists, Sets, Sorted Sets, Hashes, Streams, HyperLogLogs, Bitmaps.
http://redis.io
Other
66.94k stars 23.8k forks source link

Sentinel Corruption : Redis 5.0.5 #11384

Open geekthread opened 2 years ago

geekthread commented 2 years ago

We are on redis 5.0.5. We are seeing Read Only replica errors on our downstream systems.
On further debugging the issue we found that sentinel of the master node was incorrectly pointing to the slave node

A short description of the bug.


Steps to reproduce the behavior and/or a minimal code sample.

At present we are not sure about the root cause of the issue, but we see this issue when we are doing blue green upgrade on direct dependent of redis.

All the sentinels should report the actual master

A description of what you expected to happen.

** Output of redis-cli info from all nodes image

• Please see the output of sentinel masters from node 2 (Actual Master) image Output of sentinel masters from node 1 (Slave) image

**

Any additional information that is relevant to the problem.

sentinel_6379.conf.txt sentinel_6381.conf.txt sentinel_6382.conf.txt redis_6379.conf.txt redis_6381.conf.txt redis_6382.conf.txt

moticless commented 2 years ago

Hi @geekthread ,

I need to know that “actual” configuration was valid and all sentinels really point to a single master at start (and not each sentinel points to a different instance). I.e., my guess is that the actual values of those placeholders are invalid and each sentinel is configured with different Redis instance rather than all points to single-master. This is the way they find each other at start.

iav20 commented 2 years ago

Hi @moticless

Please find answers below -

  1. Regarding the error stack trace , Please refer below

"msg":"Operation Failed on flc_10134_cust01-prd04-ins01-wfm14-fnt.int.prd.mykronos.com_employeegroups Cache. Total Attempt: 1 of max Attempts 3" ,"ERR": { "org.springframework.data.redis.connection.RedisPipelineException: Pipeline contained one or more invalid commands; nested exception is org.springframework.dao.InvalidDataAccessApiUsageException: READONLY You can't write against a read only replica.; nested exception is redis.clients.jedis.exceptions.JedisDataException: READONLY You can't write against a read only replica." : [["JedisConnection.java",398,"org.springframework.data.redis.connection.jedis.JedisConnection.convertPipelineResults"],["JedisConnection.java",363,"org.springframework.data.redis.connection.jedis.JedisConnection.closePipeline"],["",-1,"sun.reflect.GeneratedMethodAccessor929.invoke"],["DelegatingMethodAccessorImpl.java",43,"sun.reflect.DelegatingMethodAccessorImpl.invoke"],["Method.java",498,"java.lang.reflect.Method.invoke"],["CloseSuppressingInvocationHandler.java",61,"org.springframework.data.redis.core.CloseSuppressingInvocationHandler.invoke"],["",-1,"com.sun.proxy.$Proxy1186.closePipeline"],["KronosRedisTemplate.java",62,"com.kronos.cache.impl.configuration.KronosRedisTemplate.lambda$executePipelined$1"],["RedisTemplate.java",228,"org.springframework.data.redis.core.RedisTemplate.execute"],["RedisTemplate.java",188,"org.springframework.data.redis.core.RedisTemplate.execute"],["RedisTemplate.java",175,"org.springframework.data.redis.core.RedisTemplate.execute"]

  1. Regarding the sequence of steps , please refer

  9) "flags"    10) "s_down,master"

  1. Regarding to which version we upgrade

We were express upgrading our client nodes wherein new nodes are added to the cluster and old nodes are removed from the cluster

  1. Regarding if this is reproducible without upgrade flow

Normally this error occurs during Express upgrade only but we have also seen few occurrences wherein the Redis dumps were not cleared and rolling restart of Redis cluster caused this error.

  1. Regarding checking the status before the flow?

We did not check the status before the flow but health was fine as we did not receive any error alert.

  1. Regarding the configuration files

Sentinel.conf from node 1

port 26379 logfile "/data/redis-sentinel/logs/redis-sentinel.log" dir "/tmp" protected-mode no sentinel myid 229f84abda93278a00ae8406e95e4227e96fa881 sentinel deny-scripts-reconfig yes sentinel monitor cust01-prd04-ins01-wfm14-dmc-1.int.prd.mykronos.com 10.249.97.147 6379 2 sentinel down-after-milliseconds cust01-prd04-ins01-wfm14-dmc-1.int.prd.mykronos.com 10000 sentinel failover-timeout cust01-prd04-ins01-wfm14-dmc-1.int.prd.mykronos.com 30000 sentinel config-epoch cust01-prd04-ins01-wfm14-dmc-1.int.prd.mykronos.com 9 sentinel leader-epoch cust01-prd04-ins01-wfm14-dmc-1.int.prd.mykronos.com 9 sentinel known-replica cust01-prd04-ins01-wfm14-dmc-1.int.prd.mykronos.com 10.249.97.61 6381 sentinel known-replica cust01-prd04-ins01-wfm14-dmc-1.int.prd.mykronos.com 10.249.97.149 6382 sentinel known-sentinel cust01-prd04-ins01-wfm14-dmc-1.int.prd.mykronos.com 10.249.97.149 26382 fe8176bfbdd1eef28135d5b58337f60c7b28df66 sentinel known-sentinel cust01-prd04-ins01-wfm14-dmc-1.int.prd.mykronos.com 10.249.97.61 26381 c1e683dea659370a0df83fdd3226432466eae4d6 sentinel rename-command cust01-prd04-ins01-wfm14-dmc-1.int.prd.mykronos.com CONFIG 2d1e1630f50e11eb9a030242ac130003 sentinel current-epoch 9 sentinel announce-ip "10.249.97.147" sentinel announce-port 26379

Redis.conf from node 1

daemonize no pidfile "/var/run/redis.pid" port 6379 tcp-backlog 511 timeout 1440 tcp-keepalive 0 loglevel notice logfile "/data/redis/logs/redis-server.log" databases 16 stop-writes-on-bgsave-error yes rdbcompression yes rdbchecksum yes replica-serve-stale-data yes replica-read-only yes repl-diskless-sync no repl-diskless-sync-delay 5 repl-disable-tcp-nodelay no repl-timeout 600 repl-backlog-size 100mb replica-priority 100 maxmemory 31gb maxmemory-policy volatile-lru appendonly no appendfsync everysec no-appendfsync-on-rewrite no auto-aof-rewrite-percentage 100 auto-aof-rewrite-min-size 64mb aof-load-truncated yes lua-time-limit 5000 slowlog-log-slower-than 10000 slowlog-max-len 128 latency-monitor-threshold 0 notify-keyspace-events "" hash-max-ziplist-entries 512 hash-max-ziplist-value 64 list-max-ziplist-entries 512 list-max-ziplist-value 64 set-max-intset-entries 512 zset-max-ziplist-entries 128 zset-max-ziplist-value 64 hll-sparse-max-bytes 3000 activerehashing yes client-output-buffer-limit normal 0 0 0 client-output-buffer-limit replica 2147483684 1gb 120 client-output-buffer-limit pubsub 32mb 8mb 60 hz 10 aof-rewrite-incremental-fsync yes protected-mode no activedefrag yes active-defrag-cycle-min 1 active-defrag-cycle-max 25 rename-command CONFIG 2d1e1630f50e11eb9a030242ac130003 dir "/"

Do let us know if more info is required

Regards, Apoorv

moticless commented 2 years ago

Hi @geekthread,

iav20 commented 2 years ago

cust01-prd04-ins01-wfm14-dmc-1632814069-3.txt

cust01-prd04-ins01-wfm14-dmc-1632814069-1.txt

cust01-prd04-ins01-wfm14-dmc-1632814069-2.txt

Please find attached the redis and sentinel.conf files from all three nodes.

Best Regards, Apoorv

moticless commented 2 years ago

HI @iav20 , Like i said, I think you should understand first why you get "You can't write against a read only replica".

I see that you are running in docker. I guess that you have 3 containers such that each run one sentinel and one replica. Based on your replicas configuration it looks like you are running without any NAT (otherwise you would have replica-announce-*). Then you configured sentinel as if they are in NAT environment with announce-ip. I think you should simplify your configuration by removing redundant sentinel announce-* (if I understand correctly your setup).

geekthread commented 1 year ago

Hello @iav20 , Yes we are on docker. As you suggested we will remove sentinel-announce-* config and do some tests. We will keep this thread updated.