[BUG]In big cluster Redis (Valkey) replica node is not able to sync since 6.2.11

hwware commented 2 months ago

Describe the bug

Reference: Redis issue https://github.com/redis/redis/issues/12001 Since Redis PR https://github.com/redis/redis/pull/11785 is involved, once one single node data more than 7GB, replica node can not sync with primary node.

To reproduce

Reference from Redis issue https://github.com/redis/redis/issues/12001 and link https://github.com/redis/redis/issues/12001#issuecomment-1743066121

create cluster with 6 nodes (3 masters, 3 replicas)
fill each node with 7gb or more data
restart replica
with big probability replica won't be able to sync with error like == CRITICAL == This replica is sending an error to its master: 'Protocol error: too big inline request' after processing the command '' sometimes though it successfully finishes (~10% of cases)

Expected behavior

Replica node could sync with primary node

Additional information

Any additional information that is relevant to the problem.

zuiderkwast commented 2 months ago

Is it starvation of the cluster bus?

Maybe something like dual channel sync can help? Rdb in fork process so main process can still talk to cluster bus...

shanipribadi commented 2 months ago

the linked referenced issue also has a proposed open PR to attempt to fix the issue https://github.com/redis/redis/pull/13308

basically the redis inbound cluster connection tcp keepalive idle time was set to 2 * cluster node timeout, which typically people set to quite agggressive values (seconds). and tcp keepalive idle interval is set to 1/3 of idle time right now.

https://github.com/valkey-io/valkey/blob/unstable/src/cluster_legacy.c#L1412

so the PR makes the tcp keepalive settings of the redis inbound cluster connection to be configurable by the existing config variable server.tcpkeepalive (same as other redis server connection).

valkey-io / valkey

[BUG]In big cluster Redis (Valkey) replica node is not able to sync since 6.2.11 #825