krakjoe / apcu

APCu - APC User Cache
Other
957 stars 193 forks source link

Could OOM Killer cause APCU deadlock ? #486

Open TomZhuPlanetart opened 1 year ago

TomZhuPlanetart commented 1 year ago

APCU Version: 5.1.22 PHP Version: 8.1.9

Today two of our production servers stopped responding. Nginx kept returning 502, and the error log had this error:

2023/03/16 06:39:50 [error] 10953#0: *783933 connect() to unix:/var/run/php-fpm.sock failed (11: Resource temporarily unavailable) while connecting to upstream

The number of php-fpm process reached the max_children. I spot checked the call stack of several of the fpm processes, all show me something like below:

#0  0x0000ffffbc6f8a1c in pthread_rwlock_wrlock () from target:/lib64/libpthread.so.0
#1  0x0000ffffb1e79188 in apc_lock_wlock_impl (lock=<optimized out>) at /var/tmp/apcu/apc_lock.c:102
#2  apc_lock_wlock (lock=<optimized out>) at /var/tmp/apcu/apc_lock.c:322
#3  0x0000ffffb1e7c834 in apc_cache_wlock (cache=0x0) at /var/tmp/apcu/apc_cache.h:293
#4  apc_cache_store (cache=0xaaaab888a0a0, key=0xffffbc1bc000, val=0xffffbc015050, ttl=<optimized out>, exclusive=39, exclusive@entry=false) at /var/tmp/apcu/apc_cache.c:511
#5  0x0000ffffb1e7a9a4 in apc_store_helper (return_value=0xffffc49db968, exclusive=false, execute_data=<optimized out>) at /var/tmp/apcu/php_apc.c:490
#6  0x0000aaaaab0fffdc in execute_ex ()
#7  0x0000aaaaab103748 in zend_execute ()
#8  0x0000aaaaab083fc4 in zend_execute_scripts ()
#9  0x0000aaaaab020528 in php_execute_script ()
#10 0x0000aaaaaae70d90 in main ()

I then checked system log with following command , it seems several fpm processes were killed by OOM.

dmesg | grep ' Killed process'
[273397.418006] Killed process 2470 (php-fpm) total-vm:610136kB, anon-rss:85900kB, file-rss:0kB, shmem-rss:57476kB
[273400.232269] Killed process 2485 (php-fpm) total-vm:610136kB, anon-rss:84876kB, file-rss:0kB, shmem-rss:58548kB
[273403.803063] Killed process 2499 (php-fpm) total-vm:676524kB, anon-rss:84760kB, file-rss:0kB, shmem-rss:58256kB
[273406.776361] Killed process 2474 (php-fpm) total-vm:608088kB, anon-rss:83756kB, file-rss:0kB, shmem-rss:58500kB
[273409.829074] Killed process 2498 (php-fpm) total-vm:608092kB, anon-rss:84096kB, file-rss:0kB, shmem-rss:57756kB
[273411.916416] Killed process 2488 (php-fpm) total-vm:609684kB, anon-rss:84940kB, file-rss:0kB, shmem-rss:57780kB
[273414.262209] Killed process 2494 (php-fpm) total-vm:609688kB, anon-rss:85128kB, file-rss:0kB, shmem-rss:57244kB
[273416.574143] Killed process 2492 (php-fpm) total-vm:608092kB, anon-rss:83796kB, file-rss:0kB, shmem-rss:58284kB
[273417.772178] Killed process 2462 (php-fpm) total-vm:610136kB, anon-rss:85428kB, file-rss:0kB, shmem-rss:56720kB
[273419.842312] Killed process 2487 (php-fpm) total-vm:610140kB, anon-rss:84536kB, file-rss:0kB, shmem-rss:58008kB
[273422.121932] Killed process 2408 (php-fpm) total-vm:606472kB, anon-rss:80948kB, file-rss:8kB, shmem-rss:64068kB
[273424.428721] Killed process 2455 (php-fpm) total-vm:608276kB, anon-rss:83124kB, file-rss:0kB, shmem-rss:61248kB
[273427.680694] Killed process 2429 (php-fpm) total-vm:608248kB, anon-rss:82688kB, file-rss:0kB, shmem-rss:61796kB
[273431.686782] Killed process 2496 (php-fpm) total-vm:610140kB, anon-rss:84928kB, file-rss:0kB, shmem-rss:58700kB
[273434.162302] Killed process 2464 (php-fpm) total-vm:608012kB, anon-rss:83436kB, file-rss:0kB, shmem-rss:60548kB
[273436.500678] Killed process 2451 (php-fpm) total-vm:610180kB, anon-rss:84500kB, file-rss:0kB, shmem-rss:59340kB
[273439.167658] Killed process 2491 (php-fpm) total-vm:607964kB, anon-rss:83644kB, file-rss:0kB, shmem-rss:60104kB
[273443.495059] Killed process 2503 (php-fpm) total-vm:610136kB, anon-rss:85568kB, file-rss:0kB, shmem-rss:58444kB
[273446.585592] Killed process 2497 (php-fpm) total-vm:610184kB, anon-rss:84620kB, file-rss:0kB, shmem-rss:59540kB
[273449.305805] Killed process 2484 (php-fpm) total-vm:610140kB, anon-rss:85020kB, file-rss:0kB, shmem-rss:59148kB
[273452.450242] Killed process 2495 (php-fpm) total-vm:610340kB, anon-rss:86108kB, file-rss:0kB, shmem-rss:58356kB
[273453.889468] Killed process 2160 (php-fpm) total-vm:600828kB, anon-rss:75428kB, file-rss:0kB, shmem-rss:69152kB
[273456.191259] Killed process 2489 (php-fpm) total-vm:610136kB, anon-rss:86084kB, file-rss:0kB, shmem-rss:58552kB
[273459.088086] Killed process 2476 (php-fpm) total-vm:610140kB, anon-rss:86216kB, file-rss:12kB, shmem-rss:58480kB
[273462.998725] Killed process 2452 (php-fpm) total-vm:610188kB, anon-rss:84512kB, file-rss:0kB, shmem-rss:60088kB
[273684.069670] Killed process 2552 (php-fpm) total-vm:610140kB, anon-rss:84532kB, file-rss:8kB, shmem-rss:59740kB
[274345.304179] Killed process 2538 (php-fpm) total-vm:610136kB, anon-rss:84524kB, file-rss:0kB, shmem-rss:59704kB
[300480.909337] Killed process 2520 (php-fpm) total-vm:610140kB, anon-rss:85508kB, file-rss:0kB, shmem-rss:58668kB
[300484.820437] Killed process 20249 (yum) total-vm:324688kB, anon-rss:215068kB, file-rss:0kB, shmem-rss:0kB
[306699.394469] Killed process 30288 (python2.7) total-vm:388616kB, anon-rss:168776kB, file-rss:0kB, shmem-rss:0kB
[328553.368812] Killed process 18773 (yum) total-vm:238440kB, anon-rss:204668kB, file-rss:0kB, shmem-rss:0kB

MY PHP Code has a snippet of code using APCU like this:

    public function load($id, $doNotTestCacheValidity = false)
    {
        if (!$this->apcuLoaded) {
            return parent::load($id, $doNotTestCacheValidity);
        }

        $v = apcu_fetch(self::APCU_PREFIX . $id, $success);

        if (!$success) {
            $v = parent::load($id, $doNotTestCacheValidity);
            if ($v) {
                apcu_store(self::APCU_PREFIX . $id, $v, 10);
            }
        }

        return $v;
    }

I'm wondering whether it's possible that after one of the php process acquired the lock, and then it was killed by OOM, the lock held by it could not be released, and finally caused all processes being locked.

nikic commented 1 year ago

I'm wondering whether it's possible that after one of the php process acquired the lock, and then it was killed by OOM, the lock held by it could not be released, and finally caused all processes being locked.

Yes, unfortunately this is possible. APCu uses POSIX rwlocks, which do not support robustness. POSIX only supports robust mutexes.