Flushing a site in "keys" or "scan" mode in Sentinel Setup causes failover due to long script runtime

JanThiel commented 1 year ago

We use a clustered setup with n WEB servers and 2 DB servers. All are running redis as well as redis sentinel. The DB servers are configured to be the primary master (using redis slave-priority). We have a min quorum of 5 sentinels within our setup at the moment (= min of 6 active servers).

The Webservers run a very minimalistic WordPress setup. php-fpm, nginx, redis, redis sentinel. WordPress multisite.

Using the redis flush command - for example using WP-CLI triggers reproducible Sentinel Failovers. Thus the Slaves believe that the master is down while it flushs.

It does happen most of the time (nearly all of the time) we use wp redis flush <id> or wp redis flush. Also tried with --async but it didn't work any better. Even worse, the flush does not succeed when it failsover. In that way, that the master gets flushed but as it isn't the master anymore the flush itself is not propagated throughout the cluster slaves. So we have a failover and the data is not flushed. We have to flush several times to actually be successful.

I believe it might be that the LUA script you use to flush blocks the master for a longer period than the configured timeout and thus triggering the failover procedure. But it's hard to debug as we are not aware of any way to debug why the failover happened.

Some redis stats (on master):

Around 32M keys
9GB mem used
no evictions
<100 connected clients peak
<1ms latency
600 commands per second peak

Some selected INFO returns

# Server
redis_version:7.0.8
redis_git_sha1:00000000
redis_git_dirty:0
redis_build_id:5a5034b4187b7759
redis_mode:standalone
os:Linux 4.18.0-425.3.1.el8.x86_64 x86_64
arch_bits:64
monotonic_clock:POSIX clock_gettime
multiplexing_api:epoll
atomicvar_api:c11-builtin
gcc_version:8.5.0
process_id:1960434
process_supervised:systemd
run_id:e789fcf43eff8bbdc082978beb1bfe8b7807d9bc
tcp_port:6379
server_time_usec:1676235457681791
uptime_in_seconds:2118169
uptime_in_days:24
hz:10
configured_hz:10
lru_clock:15291073
executable:/usr/bin/redis-server
config_file:/etc/redis/redis.conf
io_threads_active:0

# Replication
role:master
connected_slaves:5
slave0:ip=10.0.0.14,port=6379,state=online,offset=170887507795,lag=0
slave1:ip=10.0.0.19,port=6379,state=online,offset=170887506177,lag=1
slave2:ip=10.0.0.11,port=6379,state=online,offset=170887511135,lag=0
slave3:ip=10.0.0.18,port=6379,state=online,offset=170887511894,lag=0
slave4:ip=10.0.0.24,port=6379,state=online,offset=170887511135,lag=0
master_failover_state:no-failover
master_replid:a0ab0a705406134b40ea0795ae7a6cfd114d41c2
master_replid2:36f68399547565cf79bbc4392855c54495ef5e6c
master_repl_offset:170887512054
second_repl_offset:159715309504
repl_backlog_active:1
repl_backlog_size:1048576
repl_backlog_first_byte_offset:170886458968
repl_backlog_histlen:1053087

# Errorstats
errorstat_BUSY:count=2405
errorstat_ERR:count=1
errorstat_LOADING:count=4965
errorstat_NOAUTH:count=1527
errorstat_READONLY:count=1
errorstat_UNKILLABLE:count=8
errorstat_WRONGPASS:count=1

Here are the relevant configs:

redis.conf

...
maxmemory-policy allkeys-lru
maxmemory 12gb
...

sentinel.conf

sentinel monitor mymaster 10.0.0.23 6379 2
sentinel down-after-milliseconds mymaster 5000
sentinel failover-timeout mymaster 10000
sentinel deny-scripts-reconfig yes
...

With this wp-config settings:

define('WP_REDIS_CONFIG', [
    'token' => '...',
    'sentinels' => [
        'tcp://127.0.0.1:26379',
        'tcp://10.0.0.23:26379',
        'tcp://10.0.0.24:26379',
    ],
    'service' => 'mymaster',
    'password' => '...',
    'persistent' => true,
    'database' => 1, // change for each site
    'maxttl' => 86400 * 7,
    'timeout' => 0.5,
    'read_timeout' => 0.5,
    'retry_interval' => 50,
    'analytics' => [
        'enabled' => false,
    ],
    'shared' => true,
    'split_alloptions' => true,
    'async_flush' => true,
    'flush_network' => 'site',
    'serializer' => 'igbinary',
    'compression' => 'zstd',
    'debug' => false,
]);

And here is the sentinel log for one of these failovers while flushing:

1867759:X 08 Feb 2023 20:23:18.175 # +sdown master mymaster 10.0.0.23 6379
1867759:X 08 Feb 2023 20:23:18.237 # +odown master mymaster 10.0.0.23 6379 #quorum 2/2
1867759:X 08 Feb 2023 20:23:18.237 # +new-epoch 239
1867759:X 08 Feb 2023 20:23:18.237 # +try-failover master mymaster 10.0.0.23 6379
1867759:X 08 Feb 2023 20:23:18.238 * Sentinel new configuration saved on disk
1867759:X 08 Feb 2023 20:23:18.238 # +vote-for-leader 58d0864b5eaaa3ac15f56b503fb470984674b692 239
1867759:X 08 Feb 2023 20:23:18.240 # fdcd32f37794255b5df4bc4b3aa34c0693be2fd3 voted for 58d0864b5eaaa3ac15f56b503fb470984674b692 239
1867759:X 08 Feb 2023 20:23:18.243 # aipoh0foh6uosh9jeineeteedoh5Bies2Ohsaemi voted for 58d0864b5eaaa3ac15f56b503fb470984674b692 239
1867759:X 08 Feb 2023 20:23:18.245 # oog9theitoo6lei9oumi9ivai2Peeh3oree2quee voted for 58d0864b5eaaa3ac15f56b503fb470984674b692 239
1867759:X 08 Feb 2023 20:23:18.246 # hixai7wohpei6EeYiewahgh1joo1kung3em3taN3 voted for 58d0864b5eaaa3ac15f56b503fb470984674b692 239
1867759:X 08 Feb 2023 20:23:18.250 # tha6eeti4aiquei9Naethohreaxaengo0Ou5uzei voted for 58d0864b5eaaa3ac15f56b503fb470984674b692 239
1867759:X 08 Feb 2023 20:23:18.291 # +elected-leader master mymaster 10.0.0.23 6379
1867759:X 08 Feb 2023 20:23:18.291 # +failover-state-select-slave master mymaster 10.0.0.23 6379
1867759:X 08 Feb 2023 20:23:18.358 # +selected-slave slave 10.0.0.19:6379 10.0.0.19 6379 @ mymaster 10.0.0.23 6379
1867759:X 08 Feb 2023 20:23:18.358 * +failover-state-send-slaveof-noone slave 10.0.0.19:6379 10.0.0.19 6379 @ mymaster 10.0.0.23 6379
1867759:X 08 Feb 2023 20:23:18.435 * +failover-state-wait-promotion slave 10.0.0.19:6379 10.0.0.19 6379 @ mymaster 10.0.0.23 6379
1867759:X 08 Feb 2023 20:23:19.113 * Sentinel new configuration saved on disk
1867759:X 08 Feb 2023 20:23:19.113 # +promoted-slave slave 10.0.0.19:6379 10.0.0.19 6379 @ mymaster 10.0.0.23 6379
1867759:X 08 Feb 2023 20:23:19.113 # +failover-state-reconf-slaves master mymaster 10.0.0.23 6379
1867759:X 08 Feb 2023 20:23:19.202 * +slave-reconf-sent slave 10.0.0.14:6379 10.0.0.14 6379 @ mymaster 10.0.0.23 6379
1867759:X 08 Feb 2023 20:23:19.286 * +slave-reconf-inprog slave 10.0.0.14:6379 10.0.0.14 6379 @ mymaster 10.0.0.23 6379
1867759:X 08 Feb 2023 20:23:19.286 * +slave-reconf-done slave 10.0.0.14:6379 10.0.0.14 6379 @ mymaster 10.0.0.23 6379
1867759:X 08 Feb 2023 20:23:19.369 # -odown master mymaster 10.0.0.23 6379
1867759:X 08 Feb 2023 20:23:19.369 * +slave-reconf-sent slave 10.0.0.11:6379 10.0.0.11 6379 @ mymaster 10.0.0.23 6379
1867759:X 08 Feb 2023 20:23:19.387 * +slave-reconf-inprog slave 10.0.0.11:6379 10.0.0.11 6379 @ mymaster 10.0.0.23 6379
1867759:X 08 Feb 2023 20:23:20.402 * +slave-reconf-done slave 10.0.0.11:6379 10.0.0.11 6379 @ mymaster 10.0.0.23 6379
1867759:X 08 Feb 2023 20:23:20.492 * +slave-reconf-sent slave 10.0.0.24:6379 10.0.0.24 6379 @ mymaster 10.0.0.23 6379
1867759:X 08 Feb 2023 20:23:21.274 # -sdown master mymaster 10.0.0.23 6379
1867759:X 08 Feb 2023 20:23:21.409 * +slave-reconf-inprog slave 10.0.0.24:6379 10.0.0.24 6379 @ mymaster 10.0.0.23 6379
1867759:X 08 Feb 2023 20:23:29.154 # +failover-end-for-timeout master mymaster 10.0.0.23 6379
1867759:X 08 Feb 2023 20:23:29.154 # +failover-end master mymaster 10.0.0.23 6379
1867759:X 08 Feb 2023 20:23:29.154 * +slave-reconf-sent-be slave 10.0.0.24:6379 10.0.0.24 6379 @ mymaster 10.0.0.23 6379
1867759:X 08 Feb 2023 20:23:29.154 * +slave-reconf-sent-be slave 10.0.0.18:6379 10.0.0.18 6379 @ mymaster 10.0.0.23 6379
1867759:X 08 Feb 2023 20:23:29.154 # +switch-master mymaster 10.0.0.23 6379 10.0.0.19 6379
1867759:X 08 Feb 2023 20:23:29.154 * +slave slave 10.0.0.14:6379 10.0.0.14 6379 @ mymaster 10.0.0.19 6379
1867759:X 08 Feb 2023 20:23:29.154 * +slave slave 10.0.0.11:6379 10.0.0.11 6379 @ mymaster 10.0.0.19 6379
1867759:X 08 Feb 2023 20:23:29.154 * +slave slave 10.0.0.24:6379 10.0.0.24 6379 @ mymaster 10.0.0.19 6379
1867759:X 08 Feb 2023 20:23:29.154 * +slave slave 10.0.0.18:6379 10.0.0.18 6379 @ mymaster 10.0.0.19 6379
1867759:X 08 Feb 2023 20:23:29.154 * +slave slave 10.0.0.23:6379 10.0.0.23 6379 @ mymaster 10.0.0.19 6379
1867759:X 08 Feb 2023 20:23:29.155 * Sentinel new configuration saved on disk

After each failover OCP responds that there are no available Sentinels for some time (some minutes) before operation continues as expected. That's as well the time the slaves need to finish their failover I believe.

Appreciating any suggestions on how to debug or get this fixed :-)

Thanks,

Jan

JanThiel commented 1 year ago

Got some more intel pointing into the direction of one of the scripts:

1960434:M 10 Feb 2023 10:47:23.065 # Slow script detected: still in execution after 6733 milliseconds. You can try killing the script using the SCRIPT KILL command. Script name is: 1c200b98cbcdecb3f02c5af4f22ded1f954b62f0.
1960434:M 10 Feb 2023 10:47:23.072 # Connection with replica 10.0.0.19:6379 lost.
1960434:M 10 Feb 2023 10:47:24.672 # Connection with replica client id #9982046 lost.
1960434:M 10 Feb 2023 10:47:29.769 # Connection with replica client id #9982015 lost.

tillkruss commented 1 year ago

Can you try using a different flushing approach? The default keys chokes on very large datasets, hence the Slow script detected.

define('WP_REDIS_CONFIG', [
    'group_flush' => 'scan',
    // 'group_flush' => 'incremental',
]);

JanThiel commented 1 year ago

scan - triggered a failover again instantly

incremental : First try:

$ wp redis flush <ID>
objectcache.error: socket error on read socket
Error: Object cache of site [<ID>] could not be flushed.

Second try: Took about 10 minutes for a small site but did not trigger a failover.

$ wp redis flush <ID>
Success: Object cache of site [<ID>]was flushed.

tillkruss commented 1 year ago

How many keys do you have? DBSIZE

JanThiel commented 1 year ago

How many keys do you have? DBSIZE

db1:keys=34335789,expires=34335673,avg_ttl=270290842

tillkruss commented 1 year ago

I don't think this is an issue with Sentinel.

35M keys is a lot to scan, I don't think the flush_network option is realistic to use in this case.

The only option I see here is to use per-site prefixes for invalidation, which would take a while to implement.

This approach also comes with the downside that those keys need to expire using TTL which mean a lot more data in Redis because essentially no full flush happens.

JanThiel commented 1 year ago

Hmmm... that's not so great news.

We know we work with large data. That was the primary reason for us to move to OCP as we kind of expected it to be capable of handling large setups based on the shiny reference list ;-).

Anyway, you are the expert. If you can offer us any solution in the near future or any other config suggestion we could work with the incremental mode as a temporary workaround.

tillkruss commented 1 year ago

We added the incremental option recently for exactly this scenario, but waiting 10 minutes for a flush is unacceptable IMO.

Quick question, using prefixes to invalidate groups would be instant and is on the roadmap. Would you be okay with a short TTL (<24h) and just letting dead data pile up in Redis instead of flushing it?

The best option right now would be to disable flush_network. It's not ideal, but will be fast and reliable.

I'll dig deep into this.

JanThiel commented 1 year ago

Totally agree, 10min flush is a nogo. But better than a certain failover event.

I currently cannot estimate the consequences of low TTLs. We do need the ability to flush data for single sites though. If it would be one or the other, I will take the single site flush.

Piling up data sadly would not work in our setup due to the fact that we run redis slaves on all our web servers through sentinel. These do not have the vast amount of RAM our DB servers have.

But again, we regularly need to flush the object cache for single sites when we setup new ones. Which is a regular task. Flushing all of redis has a massive - temporary - negative impact to all of our sites and would be too expensive to run while daily business unless absolutelty unavoidable.

I am fine with (nearly) everything you suggest as a temporary solution as long as single site flushing remains working and the db size does not double.

tillkruss commented 1 year ago

Piling up data sadly would not work in our setup due to the fact that we run redis slaves on all our web servers through sentinel. These do not have the vast amount of RAM our DB servers have.

In that case, using prefix based invalidation wouldn't do you any favours. That's too bad.

This is the best approach then:

define('WP_REDIS_CONFIG', [
    'group_flush' => 'incremental',
]);

tillkruss commented 1 year ago

We could potentially leverage Redis Cluster (instead of Sentinel) for this. Cluster shards groups into individual instances, so that if you flush site: 1 you'd only need to flush the cluster node that holds that site's data.

Currently Object Cache Pro doesn't shard by site, but by group, however this would be trivial to adjust for Multisite installations.

JanThiel commented 1 year ago

Discussed it internally. "Piling up" depends on the mass of data. Eviction would continue to work and thus it shouldn't be an issue but just a limitation. If we talk about 24h TTL the data amount should be much smaller anyway. So that should be something we could give a try riskless. Worst case: Performance impact due to much shorter TTL (1 day instead of 7 days at the moment). But that's questionable.

Sharding: We discarded "Redis Cluster" exactly due to this. We setup Redis Sentinel for HA. Redis Cluster with sharding does not offer HA at all.

Side info: Took another site flush today with incremental. Took "only" about 1 minute. That again reduced the pain on our site. Just FYI.

tillkruss commented 1 year ago

Moving to email.

rhubarbgroup / redis-cache

Flushing a site in "keys" or "scan" mode in Sentinel Setup causes failover due to long script runtime #420