tair-opensource / RedisShake

RedisShake is a Redis data processing and migration tool.
https://tair-opensource.github.io/RedisShake/
MIT License
3.84k stars 698 forks source link

Split brain issue when migration from sentinel to sentinel #846

Open tasszz2k opened 3 months ago

tasszz2k commented 3 months ago

Issue Description

Environment

Logs

If there are any error logs or other relevant logs, please provide them here.

source extra config
2024-07-29 08:56:54 INF GOOS: linux, GOARCH: amd64
2024-07-29 08:56:54 INF Ncpu: 2, GOMAXPROCS: 2
2024-07-29 08:56:54 INF pid: 136
2024-07-29 08:56:54 INF pprof_port: 6060
2024-07-29 08:56:54 INF metrics url: http://localhost:8080
2024-07-29 08:56:54 INF auth successful. address=[rs-host:6379]
2024-07-29 08:56:54 INF redisWriter connected to redis successful. address=[rs-host:6379]
2024-07-29 08:56:54 INF no password. address=[rs-temp4.redis-sentinel-dev:6379]
2024-07-29 08:56:54 INF psyncReader connected to redis successful. address=[rs-temp4.redis-sentinel-dev:6379]
2024-07-29 08:56:54 INF start save RDB. address=[rs-temp4.redis-sentinel-dev:6379]
2024-07-29 08:56:54 INF send [replconf listening-port 10007]
2024-07-29 08:56:54 INF send [PSYNC ? -1]
2024-07-29 08:56:54 INF receive [FULLRESYNC f239bc3e8e9082b1987b4829b8f0d65658f890af 1440733]
2024-07-29 08:56:54 INF source db is doing bgsave. address=[rs-temp4.redis-sentinel-dev:6379]
2024-07-29 08:56:54 INF source db bgsave finished. timeUsed=[0.05]s, address=[rs-temp4.redis-sentinel-dev:6379]
2024-07-29 08:56:54 INF received rdb length. length=[178]
2024-07-29 08:56:54 INF create dump.rdb file. filename_path=[dump.rdb]
2024-07-29 08:56:54 INF save RDB finished. address=[rs-temp4.redis-sentinel-dev:6379], total_bytes=[178]
2024-07-29 08:56:54 INF start send RDB. address=[rs-temp4.redis-sentinel-dev:6379]
2024-07-29 08:56:54 INF start save AOF. address=[rs-temp4.redis-sentinel-dev:6379]
2024-07-29 08:56:54 INF RDB version: 9
2024-07-29 08:56:54 INF AOFWriter open file. filename=[1440733.aof]
2024-07-29 08:56:54 INF RDB AUX fields. key=[redis-ver], value=[6.2.7]
2024-07-29 08:56:54 INF RDB AUX fields. key=[redis-bits], value=[64]
2024-07-29 08:56:54 INF RDB AUX fields. key=[ctime], value=[1722243414]
2024-07-29 08:56:54 INF RDB AUX fields. key=[used-mem], value=[1923136]
2024-07-29 08:56:54 INF RDB repl-stream-db: 0
2024-07-29 08:56:54 INF RDB AUX fields. key=[repl-id], value=[f239bc3e8e9082b1987b4829b8f0d65658f890af]
2024-07-29 08:56:54 INF RDB AUX fields. key=[repl-offset], value=[1440733]
2024-07-29 08:56:54 INF RDB AUX fields. key=[aof-preamble], value=[0]
2024-07-29 08:56:54 INF send RDB finished. address=[rs-temp4.redis-sentinel-dev:6379], repl-stream-db=[0]
2024-07-29 08:56:55 INF AOFReader open file. aof_filename=[1440733.aof]
2024-07-29 08:57:04 INF Detect data sent by reader, stop pinging
2024-07-29 08:57:04 INF goroutine 21 [running]:  [runtime/debug.Stack()]<-runtime/debug/stack.go:24 +0x65  [github.com/alibaba/RedisShake/internal/log.Panicf({0x7b87fc, 0x45}, {0xc00008b778, 0x4, 0x4})]<-github.com/alibaba/RedisShake/internal/log/func.go:27 +0x36  [github.com/alibaba/RedisShake/internal/writer.(*redisWriter).flushInterval(0xc000267480)]<-github.com/alibaba/RedisShake/internal/writer/redis.go:88 +0x369  [created by github.com/alibaba/RedisShake/internal/writer.NewRedisWriter]<-github.com/alibaba/RedisShake/internal/writer/redis.go:37 +0x19c  [
2024-07-29 08:57:04 PNC redisWriter received error. error=[EOF], argv=[ping], slots=], reply=[<nil>]
panic: redisWriter received error. error=[EOF], argv=[ping], slots=], reply=[<nil>]

goroutine 21 [running]:
github.com/rs/zerolog.(*Logger).Panic.func1({0xc0001b00a0, 0x0})
    github.com/rs/zerolog@v1.28.0/log.go:375 +0x2d
github.com/rs/zerolog.(*Event).msg(0xc000112300, {0xc0001b00a0, 0x4d})
    github.com/rs/zerolog@v1.28.0/event.go:156 +0x2b8
github.com/rs/zerolog.(*Event).Msgf(0xc000112300, {0x7b87fc, 0x21d}, {0xc0000d1f78, 0x7a03e6, 0x3})
    github.com/rs/zerolog@v1.28.0/event.go:129 +0x4e
github.com/alibaba/RedisShake/internal/log.Panicf({0x7b87fc, 0x45}, {0xc0000d1f78, 0x4, 0x4})
    github.com/alibaba/RedisShake/internal/log/func.go:32 +0xef
github.com/alibaba/RedisShake/internal/writer.(*redisWriter).flushInterval(0xc000267480)
    github.com/alibaba/RedisShake/internal/writer/redis.go:88 +0x369
created by github.com/alibaba/RedisShake/internal/writer.NewRedisWriter
    github.com/alibaba/RedisShake/internal/writer/redis.go:37 +0x19c
Stream closed EOF for zlpsaas-dev/redis-migration-532a16bf-903b-4e25-97ee-8c7793c0e095-0-zpnkf (redis-shake)

Additional Information

Redis Source Cluster (1 master, 1 slave and 3 sentinel servers)

I have no name!@rfs-source5-75659ff8b7-bp7gw:/data$ redis-cli -p 26379
127.0.0.1:26379> sentinel masters
1)  1) "name"
    2) "mymaster"
    3) "ip"
    4) "172.16.175.242"
    5) "port"
    6) "6379"
    7) "runid"
    8) "94001b2edb7702056817fa3374bda2a3eaee47fe"
    9) "flags"
   10) "master"
127.0.0.1:26379> sentinel slaves mymaster
1)  1) "name"
    2) "172.16.80.247:6379"
    3) "ip"
    4) "172.16.80.247"
    5) "port"
    6) "6379"
    7) "runid"
    8) "54a241b5cb25790ff1970fdeba04bd01685da3ca"
    9) "flags"
   10) "slave"

#<== Failover here

127.0.0.1:26379> sentinel failover mymaster 
OK
127.0.0.1:26379> sentinel masters
1)  1) "name"
    2) "mymaster"
    3) "ip"
    4) "172.16.80.247"
    5) "port"
    6) "6379"
    7) "runid"
    8) "54a241b5cb25790ff1970fdeba04bd01685da3ca"
    9) "flags"
   10) "master"

Redis Destination Cluster (1 master, 0 slaves and 3 sentinel servers)

I have no name!@rfs-dest5-5b697974fd-kh6fp:/data$ redis-cli -p 26379
127.0.0.1:26379> sentinel masters
1)  1) "name"
    2) "mymaster"
    3) "ip"
    4) "172.16.175.244"
    5) "port"
    6) "6379"
    7) "runid"
    8) "afd550514d067234f8b6a5cebc9809201ef67014"
    9) "flags"
   10) "master"
   11) "link-pending-commands"
   12) "0"
127.0.0.1:26379> sentinel slaves mymaster
(empty array)

#<== Failover here

127.0.0.1:26379> sentinel masters
1)  1) "name"
    2) "mymaster"
    3) "ip"
    4) "172.16.80.247" # <== the master node of the source cluster
    5) "port"
    6) "6379"
    7) "runid"
    8) ""
    9) "flags"
   10) "master,disconnected"
127.0.0.1:26379> sentinel slaves mymaster
1)  1) "name"
    2) "172.16.175.244:6379" # <== the "old" master node of the destination cluster
    3) "ip"
    4) "172.16.175.244"
    5) "port"
    6) "6379"
    7) "runid"
    8) "afd550514d067234f8b6a5cebc9809201ef67014"
    9) "flags"
   10) "slave"
suxb201 commented 3 months ago

Could you please provide the specific version of your RedisShake? Does it include the fix mentioned in this issue: https://github.com/tair-opensource/RedisShake/issues/656 ( 513fc62a )?

tasszz2k commented 3 months ago

the current version we are using is https://github.com/tair-opensource/RedisShake/commit/2937df8a3ad839efd129882a8c1f986fd0bc1eab

suxb201 commented 3 months ago

Try the latest version, or modify the code to filter out__sentinel__:hello, just like what 513fc62a did.

tasszz2k commented 3 months ago

Try the latest version, or modify the code to filter out__sentinel__:hello, just like what 513fc62a did.

let me try it. thx

tasszz2k commented 3 months ago

thank you @suxb201

it saves the day

tasszz2k commented 3 months ago

however, If we ignore the condition like this cmd_name == "PUBLISH" and keys[1] == "__sentinel__:hello" only, it will not work. After that, I update the logic to ignore this one cmd_name == "PUBLISH" and (keys[1]== nil or keys[1] == '' or keys[1] == "__sentinel__:hello"), it works normally.

the final filter.lua is:

function filter(id, is_base, group, cmd_name, keys, slots, db_id, timestamp_ms)
    if cmd_name == "PING" then
        return 1, db_id -- disallow
    end
    if cmd_name == "REPLCONF" then
        return 1, db_id -- disallow
    end
    if cmd_name == "OPINFO" then
        return 1, db_id -- disallow
    end
    if cmd_name == "PUBLISH" and (keys[1]== nil or keys[1] == '' or keys[1] == "__sentinel__:hello") then
        return 1, db_id -- disallow
    end

    return 0, db_id -- always allow and redirect to the same db_id
end