Redis 4.0.8 cluster, slaver crash during bgsave

qingyuan18 commented 5 years ago

we have a redis 4.0.8 cluster with 3 master and 3 slavers , one or two of the slaver nodes alwasy crash during bgsave, the crash report as below:

Suspect RAM error? Use redis-server --test-memory to verify it.

17841:S 28 Aug 02:19:51.394 # Background saving terminated by signal 11 17841:S 28 Aug 02:19:52.021 10000 changes in 120 seconds. Saving... 17841:S 28 Aug 02:20:01.355 Background saving started by pid 21707

=== REDIS BUG REPORT START: Cut & paste starting from here === 21707:C 28 Aug 02:25:41.946 # ------------------------------------------------ 21707:C 28 Aug 02:25:41.947 # !!! Software Failure. Press left mouse button to continue 21707:C 28 Aug 02:25:41.947 # Guru Meditation: Unknown object type #rdb.c:628 21707:C 28 Aug 02:25:41.947 # (forcing SIGSEGV in order to print the stack trace) 21707:C 28 Aug 02:25:41.947 # ------------------------------------------------ 21707:C 28 Aug 02:25:41.947 # Redis 4.0.8 crashed by signal: 11 21707:C 28 Aug 02:25:41.947 # Crashed running the instruction at: 0x466fd3 21707:C 28 Aug 02:25:41.947 # Accessing address: 0xffffffffffffffff 21707:C 28 Aug 02:25:41.947 # Failed assertion: (:0)

------ STACK TRACE ------ EIP: redis-rdb-bgsave 10.146.14.17:8003 cluster[0x466fd3]

Backtrace: redis-rdb-bgsave 10.146.14.17:8003 cluster[0x466dfc] redis-rdb-bgsave 10.146.14.17:8003 cluster[0x468033] /lib64/libpthread.so.0[0x30a3e0f790] redis-rdb-bgsave 10.146.14.17:8003 cluster[0x466fd3] redis-rdb-bgsave 10.146.14.17:8003 cluster[0x446ed7] redis-rdb-bgsave 10.146.14.17:8003 cluster[0x4488d7] redis-rdb-bgsave 10.146.14.17:8003 cluster[0x448bc6] redis-rdb-bgsave 10.146.14.17:8003 cluster[0x448d95] redis-rdb-bgsave 10.146.14.17:8003 cluster[0x449060] redis-rdb-bgsave 10.146.14.17:8003 cluster[0x42d4d0] redis-rdb-bgsave 10.146.14.17:8003 cluster[0x424d7d] redis-rdb-bgsave 10.146.14.17:8003 cluster[0x424f2b] redis-rdb-bgsave 10.146.14.17:8003 cluster[0x42db42] /lib64/libc.so.6(__libc_start_main+0xfd)[0x30a3a1ed5d] redis-rdb-bgsave 10.146.14.17:8003 [cluster][0x422319]

the rdb.c:628 code shows there is no correct object type:

case OBJ_ZSET:
    if (o->encoding == OBJ_ENCODING_ZIPLIST)
        return rdbSaveType(rdb,RDB_TYPE_ZSET_ZIPLIST);
    else if (o->encoding == OBJ_ENCODING_SKIPLIST)
        return rdbSaveType(rdb,RDB_TYPE_ZSET_2);
    else
        serverPanic("Unknown sorted set encoding");
case OBJ_HASH:
    if (o->encoding == OBJ_ENCODING_ZIPLIST)
        return rdbSaveType(rdb,RDB_TYPE_HASH_ZIPLIST);
    else if (o->encoding == OBJ_ENCODING_HT)
        return rdbSaveType(rdb,RDB_TYPE_HASH);
    else
        serverPanic("Unknown hash encoding");
case OBJ_MODULE:
    return rdbSaveType(rdb,RDB_TYPE_MODULE_2);
default:
    serverPanic("Unknown object type");
}

our data is zset with 18 lenth String and other simply String type:

[00.00%] Biggest string found so far 'music_user:a2d4ee50-c0e5-4712-b059-8d79c4e387f8' with 11 bytes [00.00%] Biggest zset found so far 'video:long:15252799960' with 100 members [00.00%] Biggest zset found so far 'music:fm:18701531951' with 279 members [00.00%] Biggest zset found so far 'music:fm:355981050728643' with 289 members [00.00%] Biggest zset found so far 'music:fm:867483029348786' with 290 members

10.146.14.15:7001> zrange music:fm:867483029348786 0 -1 1) "600902000007331809" 2) "600907000007852796" 3) "600907000004182649" 4) "600902000009205261" 5) "600907000008439207"

can anyone help to locate which issue and how to fix it? so far I didn't see any client code's error ,do I need to upgrade the redis to 4.0.9?

any suggestion would be appreciated ! Thanks

qingyuan18 commented 5 years ago

any suggestion?

BTW: we found the master-slaver synchronize transfer 40G+ every time: _20180829103834 seems like master - slaver synchronize full nodes's data ,not increnmental synchronzied

itamarhaber commented 5 years ago

Hello @qingyuan18

Upgrading to the latest version is generally recommended. Also, please include the full crash report in the issue.

BTW: the following is from Redis' documentation

However if there is not enough backlog in the master buffers, or if the slave is referring to an history (replication ID) which is no longer known, than a full resynchronization happens: in this case the slave will get a full copy of the dataset, from scratch.

redis / redis

Redis 4.0.8 cluster, slaver crash during bgsave #5287