Redis Object Cache throws exception during BG save operation

xblabs commented 2 weeks ago

Not sure if category "bug" is adequate, so excuse the choice.

Description

For the last month or so, I am stumbling over the problem that Wordpress suddenly throws an erro stemming from Redis Cache's access attempt of Redis.

I have restarted Redis multiple times but the problem "builds up" again during the course of a few days.

As per Redis slow log ( pasted below ) ZRANGEBYSCORE and ZREMRANGEBYSCORE Commands take a very long time (12-22 seconds). They are likely operating on a large sorted set.

Possible Fix

The only ideas I have now :

1) to check lua-time-limit to ensure this is set to a reasonable value (e.g., 1000 ms) to prevent long-running scripts from blocking the server.

2) Set an eviction policy if not already set to handle memory efficiently (maxmemory-policy).

3) implement a connection wrapper in the form ( crude first idea)

function wp_cache_get_with_retry($key, $group = '', $force = false, &$found = null) {
    $retries = 3;
    while ($retries > 0) {
        try {
            return wp_cache_get($key, $group, $force, $found);
        } catch (RedisException $e) {
            if (strpos($e->getMessage(), 'BUSY') !== false) {
                $retries--;
                sleep(1);  # delay before retrying
            } else {
                throw $e;
            }
        }
    }
    return false;
}

4) investigate how ZRANGEBYSCORE and ZREMRANGEBYSCORE Commands work in the context of the plugin

Steps to Reproduce

Sporadically occurring, usually when Redis' BG dump script is running

Environment

Plugin version: 2.5.2
PHP version: 7.4
WordPress version: 6.5.4
Client: PhpRedis (v5.3.7)

Additional context

Plugin Config

define('WP_REDIS_DATABASE', 2 );
define( 'WP_REDIS_USER_SESSION_DB ', 10 );
define( 'WP_REDIS_GLOBAL_GROUPS ', ['blog-details', 'blog-id-cache', 'blog-lookup', 'global-posts', 'networks', 'rss', 'sites', 'site-details', 'site-lookup', 'site-options', 'site-transient', 'users', 'useremail', 'userlogins', 'userslugs'] );
define( 'WP_REDIS_IGNORED_GROUPS', ['counts', 'plugins', 'user_meta', 'usermeta'] );

WP Error

[08-Jul-2024 15:46:48 UTC] PHP Fatal error:  Uncaught Error: Call to a member function get() on null in /home/sites/public_html/wp-content/object-cache.php:193
Stack trace:
#0 /home/sites/public_html/wp-includes/class-wp-textdomain-registry.php(187): wp_cache_get()
#1 /home/sites/public_html/wp-includes/class-wp-textdomain-registry.php(302): WP_Textdomain_Registry->get_language_files_from_path()
#2 /home/sites/public_html/wp-includes/class-wp-textdomain-registry.php(98): WP_Textdomain_Registry->get_path_from_lang_dir()
#3 /home/sites/public_html/wp-includes/l10n.php(1370): WP_Textdomain_Registry->get()
#4 /home/sites/public_html/wp-includes/l10n.php(1400): _load_textdomain_just_in_time()
#5 /home/sites/public_html/wp-includes/l10n.php(194): get_translations_for_domain()
#6 /home/sites/public_html/wp-includes/l10n.php(306): translate()
#7 /home/sites/public_html/wp-con in /home/sites/public_html/wp-content/object-cache.php on line 193
[08-Jul-2024 15:46:48 UTC] RedisException: BUSY Redis is busy running a script. You can only call SCRIPT KILL or SHUTDOWN NOSAVE.   in /home/sites/public_html/wp-content/object-cache.php:743
Stack trace:
#0 /home/sites/public_html/wp-content/object-cache.php(743): Redis->select()
#1 /home/sites/public_html/wp-content/object-cache.php(536): WP_Object_Cache->connect_using_phpredis()
#2 /home/sites/public_html/wp-content/object-cache.php(256): WP_Object_Cache->__construct()
#3 /home/sites/public_html/wp-includes/load.php(860): wp_cache_init()
#4 /home/sites/public_html/wp-settings.php(144): wp_start_object_cache()
#5 /home/sites/public_html/wp-config.php(155): require_once('/home/sit...')
#6 /home/sites/public_html/wp-load.php(50): require_once('/home/sit...')
#7 /home/sites/public_html/wp-blog-header.php(13): require_once('/home/sit...')
#8 /home/sites/public_html/index.php(17): require('/home/sit...')
#9 {main}

Redis Stats

### Memory

- **Used Memory**: 2.70 GB (`used_memory_human`)
- **Peak Memory Usage**: 2.70 GB (`used_memory_peak_human`)
- **Memory Overhead**: 560 MB (`used_memory_overhead`)
- **Dataset Memory**: 2.34 GB (`used_memory_dataset`), 80.70% of total used memory.

### Stats

- **Total Commands Processed**: 11,669,282,979
- **Instantaneous Operations per Second**: 581
- **Expired Keys**: 18,650,441
- **Keyspace Hits**: 15,589,916,825
- **Keyspace Misses**: 1,503,877,722
- **Latest Fork Time**: 86,497 microseconds (~86.5 ms)

### Commandstats

- **GET Calls**: 6,991,708,927, Avg Time: 5.55 µs
- **SET Calls**: 2,080,750,661, Avg Time: 2.88 µs
- **MGET Calls**: 2,281,110,221, Avg Time: 8.61 µs
- **EVAL Calls**: 189, Avg Time: 3,044,714.75 µs

Plugin Context

WP_REDIS_IGNORED_GROUPS: [
    "counts",
    "plugins",
    "user_meta",
    "usermeta"
]
Global Groups: [
    "blog-details",
    "blog-id-cache",
    "blog-lookup",
    "global-posts",
    "networks",
    "rss",
    "sites",
    "site-details",
    "site-lookup",
    "site-options",
    "site-transient",
    "users",
    "useremail",
    "userlogins",
    "usermeta",
    "user_meta",
    "userslugs",
    "redis-cache",
    "blog_meta",
    "network-queries",
    "site-queries",
    "theme_files",
    "translation_files",
    "user-queries"
]
Ignored Groups: [
    "counts",
    "plugins",
    "user_meta",
    "usermeta",
    "theme_json",
    "wordfence",
    "wordfence-ls",
    "themes"
]
Unflushable Groups: []
Groups Types: {
    "blog-details": "global",
    "blog-id-cache": "global",
    "blog-lookup": "global",
    "global-posts": "global",
    "networks": "global",
    "rss": "global",
    "sites": "global",
    "site-details": "global",
    "site-lookup": "global",
    "site-options": "global",
    "site-transient": "global",
    "users": "global",
    "useremail": "global",
    "userlogins": "global",
    "usermeta": "ignored",
    "user_meta": "ignored",
    "userslugs": "global",
    "redis-cache": "global",
    "counts": "ignored",
    "plugins": "ignored",
    "blog_meta": "global",
    "network-queries": "global",
    "site-queries": "global",
    "theme_files": "global",
    "translation_files": "global",
    "user-queries": "global",
    "theme_json": "ignored",
    "wordfence": "ignored",
    "wordfence-ls": "ignored",
    "themes": "ignored"
}

Result of redis-cli slowlog get

1) 1) (integer) 5358
    2) (integer) 1717420089
    3) (integer) 14734
    4) 1) "ZREMRANGEBYSCORE"
       2) "pon21:redis-cache:metrics"
       3) "0"
       4) "1717416489"
    5) "127.0.0.1:34772"
    6) ""
 2) 1) (integer) 5357
    2) (integer) 1717416481
    3) (integer) 14878
    4) 1) "ZREMRANGEBYSCORE"
       2) "pon21:redis-cache:metrics"
       3) "0"
       4) "1717412881"
    5) "127.0.0.1:39122"
    6) ""
 3) 1) (integer) 5356
    2) (integer) 1717415923
    3) (integer) 10074
    4) 1) "GET"
       2) "pon21:options:mcp_video_sitemap_enabled"
    5) "127.0.0.1:38734"
    6) ""
 4) 1) (integer) 5355
    2) (integer) 1717412881
    3) (integer) 13912
    4) 1) "ZREMRANGEBYSCORE"
       2) "pon21:redis-cache:metrics"
       3) "0"
       4) "1717409281"
    5) "127.0.0.1:44402"
    6) ""
 5) 1) (integer) 5354
    2) (integer) 1717412390
    3) (integer) 15412
    4) 1) "ZRANGEBYSCORE"
       2) "pon21:redis-cache:metrics"
       3) "1717408790"
       4) "1717412330"
       5) "WITHSCORES"
    5) "127.0.0.1:50840"
    6) ""
 6) 1) (integer) 5353
    2) (integer) 1717412138
    3) (integer) 19850
    4)  1) "MGET"
        2) "pon21:term_meta:10283"
        3) "pon21:term_meta:10286"
        4) "pon21:term_meta:872"
        5) "pon21:term_meta:11232"
        6) "pon21:term_meta:11231"
        7) "pon21:term_meta:11175"
        8) "pon21:term_meta:11176"
        9) "pon21:term_meta:1273"
       10) "pon21:term_meta:10287"
       11) "pon21:term_meta:845"
       12) "pon21:term_meta:10290"
       13) "pon21:term_meta:204"
       14) "pon21:term_meta:11244"
       15) "pon21:term_meta:11237"
       16) "pon21:term_meta:11238"
       17) "pon21:term_meta:11239"
       18) "pon21:term_meta:11240"
       19) "pon21:term_meta:11241"
       20) "pon21:term_meta:11242"
       21) "pon21:term_meta:11243"
       22) "pon21:term_meta:10291"
       23) "pon21:term_meta:205"
       24) "pon21:term_meta:207"
       25) "pon21:term_meta:11156"
       26) "pon21:term_meta:871"
       27) "pon21:term_meta:1117"
       28) "pon21:term_meta:305"
       29) "pon21:term_meta:309"
       30) "pon21:term_meta:486"
       31) "pon21:term_meta:303"
       32) "... (233 more arguments)"
    5) "127.0.0.1:52178"
    6) ""
 7) 1) (integer) 5352
    2) (integer) 1717411649
    3) (integer) 10838
    4) 1) "GET"
       2) "pon21:options:notoptions"
    5) "127.0.0.1:43900"
    6) ""
 8) 1) (integer) 5351
    2) (integer) 1717411121
    3) (integer) 11669
    4)  1) "MGET"
        2) "pon21:term_meta:11232"
        3) "pon21:term_meta:11230"
        4) "pon21:term_meta:11231"
        5) "pon21:term_meta:1273"
        6) "pon21:term_meta:11632"
        7) "pon21:term_meta:197"
        8) "pon21:term_meta:11207"
        9) "pon21:term_meta:10287"
       10) "pon21:term_meta:204"
       11) "pon21:term_meta:912"
       12) "pon21:term_meta:11244"
       13) "pon21:term_meta:11238"
       14) "pon21:term_meta:11239"
       15) "pon21:term_meta:11240"
       16) "pon21:term_meta:11241"
       17) "pon21:term_meta:11242"
       18) "pon21:term_meta:11243"
       19) "pon21:term_meta:205"
       20) "pon21:term_meta:6846"
       21) "pon21:term_meta:11213"
       22) "pon21:term_meta:10281"
       23) "pon21:term_meta:11186"
       24) "pon21:term_meta:11324"
       25) "pon21:term_meta:1117"
       26) "pon21:term_meta:407"
       27) "pon21:term_meta:486"
       28) "pon21:term_meta:254"
       29) "pon21:term_meta:410"
       30) "pon21:term_meta:396"
       31) "pon21:term_meta:1791"
       32) "... (231 more arguments)"
    5) "127.0.0.1:52790"
    6) ""
 9) 1) (integer) 5350
    2) (integer) 1717411077
    3) (integer) 16472431
    4) 1) "EVAL"
       2) "                local cur = 0\n                local i = 0\n                local tmp\n                repeat\n                    t... (391 more bytes)"
       3) "0"
    5) "127.0.0.1:50358"
    6) ""
10) 1) (integer) 5349
    2) (integer) 1717410893
    3) (integer) 16354903
    4) 1) "EVAL"
       2) "                local cur = 0\n                local i = 0\n                local tmp\n                repeat\n                    t... (391 more bytes)"
       3) "0"
    5) "127.0.0.1:36900"
    6) ""

tillkruss commented 2 weeks ago

Can you post your config, aside from host/port/password?

xblabs commented 2 weeks ago

define('WP_REDIS_DATABASE', 2 );
define( 'WP_REDIS_USER_SESSION_DB ', 10 );
define( 'WP_REDIS_GLOBAL_GROUPS ', ['blog-details', 'blog-id-cache', 'blog-lookup', 'global-posts', 'networks', 'rss', 'sites', 'site-details', 'site-lookup', 'site-options', 'site-transient', 'users', 'useremail', 'userlogins', 'userslugs'] );
define( 'WP_REDIS_IGNORED_GROUPS', ['counts', 'plugins', 'user_meta', 'usermeta'] );

Updated in the description above as well

xblabs commented 2 weeks ago

@tillkruss , do you think trimming the groups in a certain way could help?

yatsukhnenko commented 2 weeks ago

@xblabs you may try to set WP_REDIS_FLUSH_TIMEOUT to limit execution time of LUA script which using for flushing groups cache

xblabs commented 2 weeks ago

@yatsukhnenko thanks! Do you spot any groups that could be more affected by ZRANGEBYSCORE and ZREMRANGEBYSCORE command use as those seem, so I can try to exclude them?

tillkruss commented 2 weeks ago

The Z* commands are only used by the analytic feature, you can turn that off too if you like.

xblabs commented 2 weeks ago

@tillkruss that seems to have been part of the problem. Checking Redis slow log the metrics key appears a lot. Disabling this via config seems to have alleviated the problem slightly,

I also reorganised the excluded groups and added all user related groups in there. Just to much individualzed cache data in conjunction with ACF. Note, we're talking about roughly 300K user rows plus several millions user meta rows.

tillkruss commented 2 weeks ago

@xblabs have you tried giving Redis more resources or using some kind of replication?

xblabs commented 2 weeks ago

@tillkruss overall server resources had enough room, (at least I thought) even during times when the problem arose. That said, no, I haven't looked into replication (yet). I changed the Redsi conf now as well to force wider bg savve intervals.

rhubarbgroup / redis-cache