Closed JanThiel closed 1 year ago
Got some more intel pointing into the direction of one of the scripts:
1960434:M 10 Feb 2023 10:47:23.065 # Slow script detected: still in execution after 6733 milliseconds. You can try killing the script using the SCRIPT KILL command. Script name is: 1c200b98cbcdecb3f02c5af4f22ded1f954b62f0.
1960434:M 10 Feb 2023 10:47:23.072 # Connection with replica 10.0.0.19:6379 lost.
1960434:M 10 Feb 2023 10:47:24.672 # Connection with replica client id #9982046 lost.
1960434:M 10 Feb 2023 10:47:29.769 # Connection with replica client id #9982015 lost.
Can you try using a different flushing approach? The default keys
chokes on very large datasets, hence the Slow script detected
.
define('WP_REDIS_CONFIG', [
'group_flush' => 'scan',
// 'group_flush' => 'incremental',
]);
scan
- triggered a failover again instantly
incremental
:
First try:
$ wp redis flush <ID>
objectcache.error: socket error on read socket
Error: Object cache of site [<ID>] could not be flushed.
Second try: Took about 10 minutes for a small site but did not trigger a failover.
$ wp redis flush <ID>
Success: Object cache of site [<ID>]was flushed.
How many keys do you have? DBSIZE
How many keys do you have?
DBSIZE
db1:keys=34335789,expires=34335673,avg_ttl=270290842
I don't think this is an issue with Sentinel.
35M keys is a lot to scan, I don't think the flush_network
option is realistic to use in this case.
The only option I see here is to use per-site prefixes for invalidation, which would take a while to implement.
This approach also comes with the downside that those keys need to expire using TTL which mean a lot more data in Redis because essentially no full flush happens.
Hmmm... that's not so great news.
We know we work with large data. That was the primary reason for us to move to OCP as we kind of expected it to be capable of handling large setups based on the shiny reference list ;-).
Anyway, you are the expert. If you can offer us any solution in the near future or any other config suggestion we could work with the incremental
mode as a temporary workaround.
We added the incremental
option recently for exactly this scenario, but waiting 10 minutes for a flush is unacceptable IMO.
Quick question, using prefixes to invalidate groups would be instant and is on the roadmap. Would you be okay with a short TTL (<24h) and just letting dead data pile up in Redis instead of flushing it?
The best option right now would be to disable flush_network
. It's not ideal, but will be fast and reliable.
I'll dig deep into this.
Totally agree, 10min flush is a nogo. But better than a certain failover event.
I currently cannot estimate the consequences of low TTLs. We do need the ability to flush data for single sites though. If it would be one or the other, I will take the single site flush.
Piling up data sadly would not work in our setup due to the fact that we run redis slaves on all our web servers through sentinel. These do not have the vast amount of RAM our DB servers have.
But again, we regularly need to flush the object cache for single sites when we setup new ones. Which is a regular task. Flushing all of redis has a massive - temporary - negative impact to all of our sites and would be too expensive to run while daily business unless absolutelty unavoidable.
I am fine with (nearly) everything you suggest as a temporary solution as long as single site flushing remains working and the db size does not double.
Piling up data sadly would not work in our setup due to the fact that we run redis slaves on all our web servers through sentinel. These do not have the vast amount of RAM our DB servers have.
In that case, using prefix based invalidation wouldn't do you any favours. That's too bad.
This is the best approach then:
define('WP_REDIS_CONFIG', [
'group_flush' => 'incremental',
]);
We could potentially leverage Redis Cluster (instead of Sentinel) for this. Cluster shards groups into individual instances, so that if you flush site: 1
you'd only need to flush the cluster node that holds that site's data.
Currently Object Cache Pro doesn't shard by site, but by group, however this would be trivial to adjust for Multisite installations.
Discussed it internally. "Piling up" depends on the mass of data. Eviction would continue to work and thus it shouldn't be an issue but just a limitation. If we talk about 24h TTL the data amount should be much smaller anyway. So that should be something we could give a try riskless. Worst case: Performance impact due to much shorter TTL (1 day instead of 7 days at the moment). But that's questionable.
Sharding: We discarded "Redis Cluster" exactly due to this. We setup Redis Sentinel for HA. Redis Cluster with sharding does not offer HA at all.
Side info: Took another site flush today with incremental. Took "only" about 1 minute. That again reduced the pain on our site. Just FYI.
Moving to email.
We use a clustered setup with n WEB servers and 2 DB servers. All are running redis as well as redis sentinel. The DB servers are configured to be the primary master (using redis
slave-priority
). We have a min quorum of 5 sentinels within our setup at the moment (= min of 6 active servers).The Webservers run a very minimalistic WordPress setup. php-fpm, nginx, redis, redis sentinel. WordPress multisite.
Using the redis flush command - for example using WP-CLI triggers reproducible Sentinel Failovers. Thus the Slaves believe that the master is down while it flushs.
It does happen most of the time (nearly all of the time) we use
wp redis flush <id>
orwp redis flush
. Also tried with--async
but it didn't work any better. Even worse, the flush does not succeed when it failsover. In that way, that the master gets flushed but as it isn't the master anymore the flush itself is not propagated throughout the cluster slaves. So we have a failover and the data is not flushed. We have to flush several times to actually be successful.I believe it might be that the LUA script you use to flush blocks the master for a longer period than the configured timeout and thus triggering the failover procedure. But it's hard to debug as we are not aware of any way to debug why the failover happened.
Some redis stats (on master):
Some selected INFO returns
Here are the relevant configs:
redis.conf
sentinel.conf
With this wp-config settings:
And here is the sentinel log for one of these failovers while flushing:
After each failover OCP responds that there are no available Sentinels for some time (some minutes) before operation continues as expected. That's as well the time the slaves need to finish their failover I believe.
Appreciating any suggestions on how to debug or get this fixed :-)
Thanks,
Jan