Deadlock in apcu_store()

rcanavan commented 7 years ago

Yesterday, 4 (out of 12) of our production servers stopped responding within about 20 minutes of each other. On all four, all php-fpm processes that I have inspected were stuck in apcu_store(). All systems use APCu 5.1.7 with PHP 7.0.13 on CentOS 7.3. I'm looking for hints what additional information I should collect to help isolate the root cause in case this problem ever occurs again. For example, how can I determine which process is currently holding the write lock?

(gdb) zbacktrace 
[0x7fb0ba615af0] apcu_store("CDR41_1481039826_Mozilla/5.0\40(Linux;\40Android\406.0.1;\40SAMSUNG\40SM-G903F\40Build/MMB29K)\40AppleWebKit/537.36\40(KHTML,\40like\40Gecko)\40SamsungBrowser/4.0\40Chrome/44.0.2403.133\40Mobile\40Safari/537.36", array(11)[0x7fb0ba615b60]) [internal function]
[0x7fb0ba615a50] ClientCaps->save2Cache("CDR41_1481039826_Mozilla/5.0\40(Linux;\40Android\406.0.1;\40SAMSUNG\40SM-G903F\40Build/MMB29K)\40AppleWebKit/537.36\40(KHTML,\40like\40Gecko)\40SamsungBrowser/4.0\40Chrome/44.0.2403.133\40Mobile\40Safari/537.36", array(11)[0x7fb0ba615ac0]) /opt/sevenval/fit14/lib/fit/php/CDR/extension/classes/ClientCaps.php:218 
[...]

(gdb) where
#0  pthread_rwlock_wrlock () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_rwlock_wrlock.S:85
#1  0x00007fb0b3c808c9 in apc_lock_wlock (lock=<optimized out>) at /root/build/build/src/external/apcu-5.1.7/apc_lock.c:245
#2  0x00007fb0b3c841e1 in apc_cache_insert (cache=cache@entry=0x16c08e0, key=key@entry=0x7ffe57161220, value=0x7fb09ec3f418, ctxt=ctxt@entry=0x7ffe57161240, 
    t=t@entry=1485806345, exclusive=exclusive@entry=0 '\000') at /root/build/build/src/external/apcu-5.1.7/apc_cache.c:946
#3  0x00007fb0b3c846e5 in apc_cache_store (cache=0x16c08e0, strkey=<optimized out>, val=0x7fb0ba615b60, ttl=0, exclusive=exclusive@entry=0 '\000')
    at /root/build/build/src/external/apcu-5.1.7/apc_cache.c:552
#4  0x00007fb0b3c81c64 in apc_store_helper (execute_data=<optimized out>, exclusive=0 '\000', return_value=0x7fb0ba615ad0)
    at /root/build/build/src/external/apcu-5.1.7/php_apc.c:495
#5  zif_apcu_store (execute_data=<optimized out>, return_value=0x7fb0ba615ad0) at /root/build/build/src/external/apcu-5.1.7/php_apc.c:518
#6  0x0000000000745acd in ZEND_DO_ICALL_SPEC_HANDLER () at /root/build/build/src/external/php-7.0.13/Zend/zend_vm_execute.h:586
#7  0x0000000000736a5b in execute_ex (ex=<optimized out>) at /root/build/build/src/external/php-7.0.13/Zend/zend_vm_execute.h:414
#8  0x00000000006cb086 in zend_call_function (fci=fci@entry=0x7ffe57161570, fci_cache=0x7fb096e67230, fci_cache@entry=0x7ffe57161540)
    at /root/build/build/src/external/php-7.0.13/Zend/zend_execute_API.c:858
#9  0x000000000056f138 in zim_reflection_method_invokeArgs (execute_data=<optimized out>, return_value=0x7fb0ba615530)
    at /root/build/build/src/external/php-7.0.13/ext/reflection/php_reflection.c:3346
#10 0x0000000000779e8e in ZEND_DO_FCALL_SPEC_HANDLER () at /root/build/build/src/external/php-7.0.13/Zend/zend_vm_execute.h:842
#11 0x0000000000736a5b in execute_ex (ex=<optimized out>) at /root/build/build/src/external/php-7.0.13/Zend/zend_vm_execute.h:414
#12 0x00000000007a4819 in zend_execute (op_array=op_array@entry=0x7fb0ba678000, return_value=return_value@entry=0x0)
    at /root/build/build/src/external/php-7.0.13/Zend/zend_vm_execute.h:458
#13 0x00000000006ddf49 in zend_execute_scripts (type=type@entry=8, retval=retval@entry=0x0, file_count=-1168026448, file_count@entry=3)
    at /root/build/build/src/external/php-7.0.13/Zend/zend.c:1427
#14 0x000000000065c1d8 in php_execute_script (primary_file=primary_file@entry=0x7ffe57163ad0) at /root/build/build/src/external/php-7.0.13/main/main.c:2513
#15 0x000000000042d5ac in main (argc=<optimized out>, argv=<optimized out>) at /root/build/build/src/external/php-7.0.13/sapi/fpm/fpm/fpm_main.c:1968

krakjoe commented 7 years ago

Please use the latest version ?

rcanavan commented 7 years ago

The production servers will get APCu 5.1.8 next week, our test servers have it already. Have any deadlock bugs been fixed between 5.1.7 and 5.1.8?

krakjoe commented 7 years ago

A general memory corruption bug was, it's difficult to say it can or can't lead to deadlocks ... so testing would be best ;)

rcanavan commented 7 years ago

OK, so there's at least some chance that the root cause may be fixed.

I also neglected to grab a core dump of the hung processes, but I'm not sure if shared memory is included in those at all. Assuming I took a core dump next time, is there anything useful I can do analysing the APC shared memory?

Also, do you have any Idea how i can identify which process currently owns the lock, and if I can tell whether it legitimately owns the lock, or has failed to release it?

rcanavan commented 7 years ago

We've encountered another of those deadlocks, this time with apcu 5.1.8. Same backtrace and zbacktrace.

Any Idea about this: do you have any Idea how I can identify which process currently owns the lock, and if I can tell whether it legitimately owns the lock, or has failed to release it?

rcanavan commented 6 years ago

APCu 5.1.9, with PHP 7.1.13, deadlock on 3 servers approximately at the same time, all processes (on one server) have essentially the same backtrace, with identical key=key@entry=... but slightly different values passed to apc_cache_insert. According to zbacktrace, it's always the same apcu_store() call saving an array for various string keys. No iterator involved, however, apparently there are -3 readers, similar to #266:

85              subq    $WRITERS_WAKEUP, %rdi
(gdb) x/20dw $rdi - 0x0c
0x7fbbb0e77088: 0       -3      412     374
0x7fbbb0e77098: 0       80      0       128
0x7fbbb0e770a8: 0       0       0       0
0x7fbbb0e770b8: 0       0       0       0
0x7fbbb0e770c8: 81      0       1       0

(gdb) frame 3
#3  0x00007fbbc764a7e5 in apc_cache_store (cache=0x29963b0, strkey=<optimized out>, val=0x7fbbce013890, ttl=0, exclusive=exclusive@entry=0 '\000')
    at /root/build/build/src/external/apcu-5.1.9/apc_cache.c:552
552     /root/build/build/src/external/apcu-5.1.9/apc_cache.c: No such file or directory.
(gdb) print *(cache->header)
$2 = {lock = {__data = {__lock = 0, __nr_readers = 4294967293, __readers_wakeup = 412, __writer_wakeup = 374, __nr_readers_queued = 0, __nr_writers_queued = 80, 
      __writer = 0, __shared = 128, __pad1 = 0, __pad2 = 0, __flags = 0}, 
    __size = "\000\000\000\000\375\377\377\377\234\001\000\000v\001\000\000\000\000\000\000P\000\000\000\000\000\000\000\200", '\000' <repeats 26 times>, 
    __align = -12884901888}, nhits = 0, nmisses = 81, ninserts = 1, nexpunges = 1, nentries = 1, mem_size = 1200, stime = 1519200448, state = 0, lastkey = {str = 0x0, 
    mtime = 0, owner = 0}, gc = 0x0}

pthread_rwlock_wrlock () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_rwlock_wrlock.S:85
85              subq    $WRITERS_WAKEUP, %rdi

Thread 1 (Thread 0x7fbbd1efb840 (LWP 7151)):
#0  pthread_rwlock_wrlock () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_rwlock_wrlock.S:85
#1  0x00007fbbc7646959 in apc_lock_wlock (lock=<optimized out>) at /root/build/build/src/external/apcu-5.1.9/apc_lock.c:245
#2  0x00007fbbc764a2ab in apc_cache_insert (cache=cache@entry=0x29963b0, key=key@entry=0x7fffd50d9de0, value=0x7fbbb1437c68, ctxt=ctxt@entry=0x7fffd50d9e00, t=t@entry=1519200448, exclusive=exclusive@entry=0 '\000') at /root/build/build/src/external/apcu-5.1.9/apc_cache.c:948
#3  0x00007fbbc764a7e5 in apc_cache_store (cache=0x29963b0, strkey=<optimized out>, val=0x7fbbce013890, ttl=0, exclusive=exclusive@entry=0 '\000') at /root/build/build/src/external/apcu-5.1.9/apc_cache.c:552
#4  0x00007fbbc7647cf4 in apc_store_helper (execute_data=<optimized out>, exclusive=0 '\000', return_value=0x7fbbce013820) at /root/build/build/src/external/apcu-5.1.9/php_apc.c:495
#5  zif_apcu_store (execute_data=<optimized out>, return_value=0x7fbbce013820) at /root/build/build/src/external/apcu-5.1.9/php_apc.c:518

nikic commented 6 years ago

Just to check, did you see any relevant log output around that time? Crashed process, opcache reset, or something like that?

rcanavan commented 6 years ago

No messages, aside from one seems busy (you may need to increase pm.start_servers, or pm.min/max_spare_servers), spawning ... each second, presumably after the deadlock occurred, despite error_reporting = E_ALL

nikic commented 6 years ago

Are you using apcu_entry anywhere? Wondering if this might be a side-effect from #246 that only manifests on a later apcu_store call.

rcanavan commented 6 years ago

There's not even a single call to apcu_entry() in out entire codebase. Unlike #246, our problem is very rare (i.e. once every other month or so, with ~10 servers processing millions of requests per day), but usually results in multiple servers getting stuck within minutes at the most.

nikic commented 6 years ago

I think it's quite likely that this has the same root cause as #259, though your symptoms are somewhat different.

dugwood commented 6 years ago

I agree with @nikic, I got all my servers stuck the same way (after a flush of APCu and a Varnish ban), resulting in many concurrent requests. 99% of the time it's okay, but sometimes one process is blocking others.

I've seen it with strace, when you strace some of your PHP processes. For me it was PHP-FPM with more than 30 seconds responses, instead of milliseconds, so I had the time to write the PID (say 1234), run a strace -p 1234, and find some information about it. For example: futex(0xafcc1050, FUTEX_WAIT, 2, NULL <unfinished ...> Then it means the 2nd file descriptor is blocking, which you can view by issuing a: ls -al /proc/1234/fd/2 which points to the blocking file.

rcanavan commented 6 years ago

Is it possible to identify the process that is holding the lock (as opposed to waiting for it) with futexes, like it is via /proc/locks for flock() locks?

dugwood commented 6 years ago

Good question :-) I don't know, but if you have a file descriptor, you can see what it's doing. At least that was how I detected the issue was in APCu and not in MySQL or something like that.

mhouriet commented 6 years ago

Hello there,

We seem to be affected by this issue in production. Using acpu version 5.1.8 on Debian 8.10 (Jessie), packaged from Dotdeb (5.1.8-1~dotdeb+8.1).

The trace from slowlog show all FPM workers hung when calling apcu_store()

Using the strace -p <worker-pid> command always show the last syscall to futex(<uaddr>, FUTEX_WAIT, ...), where uaddr is always the same for all processes.

The work load of our web frontends in production is about 10 requests per second. Unfortunately it is hard for me to find a way to pull Nikita's patch into production, and I failed to reproduce the deadlock in our staging environment. Anyway, I will follow up on this issue and eventually report how it turned out.

PS: This was originally posted as a comment of #259 but I re-posted it here because it's not crashing any process, it's just hanging them all in a deadlock.

dugwood commented 6 years ago

@mhouriet the fact is that along with the crash of #259 that was happening during deletions (iterating on many keys and deleting a lot of keys), my second issue was the lock and servers going down too. See https://github.com/krakjoe/apcu/issues/259#issuecomment-363730090 for instance.

But all those bugs may be related, one way or another. If you can compile without RW locks, you may see if it's better. I'll try @nikic patch soon, but one way or another, you'll have to compile it again (or perhaps DotDeb will do it). Although all new releases (5.19, 5.1.10) are labelled «PHP 7 bugfixes».

mhouriet commented 6 years ago

I just compiled the module from master (i.e. with @nikic patch from #259) and experimentally deployed it to one of our frontend servers in production. The deadlock occurred at least once every few days, so I'll be monitoring it carefully and report if it solved it for us or not.

If that fails, I will also try --disable-apcu-rwlocks as @dugwood suggested.

nikic commented 6 years ago

@rcanavan Apcu 5.1.11 is now released with a probable fix for this issue. Could you please check whether it works for you as well?

rcanavan commented 6 years ago

I'll try 5.1.11 as soon as possible. However, can you fix the headline of the release on https://github.com/krakjoe/apcu/releases ? It still says 5.1.10.

remicollet commented 6 years ago

However, can you fix the headline of the release on https://github.com/krakjoe/apcu/releases ? It still says 5.1.10.

I see 5.1.11 there

nikic commented 6 years ago

@remicollet yeah, I changed it already

mhouriet commented 6 years ago

Unfortunately, I have to report a new deadlock occurrence with the patched module. All processes end up in the slowlog, doing nothing:

[0x00007f7d7be16550] apcu_store() /var/www/.../...php:494
[0x00007f7d7be16440] fetchFromApi() /var/www/.../...php:209
[0x00007f7d7be163c0] getSite() /var/www/.../...php:397

Using 5.1.11-dev with 7.0.27-1~dotdeb+8.1

root@...# strace -p 7186
Process 7186 attached
futex(0x7f7d4ee8b094, FUTEX_WAIT, 53650, NULL) = ? ERESTARTSYS (To be restarted if SA_RESTART is set)
--- SIGSTOP {si_signo=SIGSTOP, si_code=SI_USER, si_pid=22665, si_uid=0} ---
--- stopped by SIGSTOP ---
--- SIGCONT {si_signo=SIGCONT, si_code=SI_USER, si_pid=22665, si_uid=0} ---
futex(0x7f7d4ee8b094, FUTEX_WAIT, 53650, NULL) = ? ERESTARTSYS (To be restarted if SA_RESTART is set)
--- SIGTERM {si_signo=SIGTERM, si_code=SI_USER, si_pid=22665, si_uid=0} ---
+++ killed by SIGTERM +++

apcu-crash

Output from make test has been pasted here

Compiled from master with head at e9876acd49a537dc5a7e4b5ceef96f7dea3d3eae

I am upgrading to the latest PHP 7.2 on this server now and will post updates.

dugwood commented 6 years ago

What was the process 53650 doing? As you can see, it's the blocking one:

futex(0x7f7d4ee8b094, FUTEX_WAIT, 53650, NULL)

But the other big issue to me is that you only have 4% of free memory for APCu, it's not a good test I think. Can you increase the memory? APCu nukes all entries when there's not enough memory left, so you must have at least 10 or 20% of free memory in the worst case.

Did you try without RW locks too? Note that I didn't have the time to test the new release in production, so I can't say if it's really better than 5.1.9 or not.

mhouriet commented 6 years ago

I will check the blocking process next time it occurs. I can increase the shm size but we use APCu for pure volatile, local caching purpose. The normal uptime cycle for this 512MB is about 30 minutes, i.e. APCu nuking all entries when the cache is full is actually expected behavior, and what naturally occurred on the other servers over the week-end. Only this one has entered this state and we see an abnormal uptime, perhaps this is not a coincidence and the deadlock occurred by the write(s) that needed flushing the cache.

Regarding disabling RW locks, I will look into this as well next time that I recompile the module.

dugwood commented 6 years ago

Maybe the fragmentation did this. As I've detailed in https://github.com/krakjoe/apcu/issues/127#issuecomment-244362339, fragmentation is not an issue in general use cases, but if you're near the memory limit, I guess that APCu will need to find the little pieces left that are at least as big as the data you want to write. With a lot of fragmentation and very little space, this is a difficult job, and may end up in race conditions. This will happen before the «big nuke», which then makes things really easy. But to have the nuke, APCu must return an error on a set that doesn't find enough space (again, it's a guess, maybe not the real behavior).

Even if a nuke is an expected behavior, you should avoid it as it's equivalent to a cold cache.

nikic commented 6 years ago

The issue from #259 was also related to cache expunge on full cache, so looks like we're still dealing with the same problem area... As @dugwood mentioned, it would be good to have a backtrace from the process holding the lock, maybe that will give some clues.

Looking through the code again, I see that apcu_entry still has a case that could deadlock if the cache runs full. @mhouriet Do you happen to use the apcu_entry API anywhere? Don't see any other problematic cases right now.

rcanavan commented 6 years ago

What was the process 53650 doing? As you can see, it's the blocking one:

futex(0x7f7d4ee8b094, FUTEX_WAIT, 53650, NULL)

Is that actually the PID? I'm getting 184 with all processes according to strace, but that PID has never been used by php-fpm since the master process has been started. On another system, val is 180, and no php-fpm process has ever had the pid 180. The systems in question is still running 5.1.9


strace: Process 3347 attached
futex(0x7f2a27877094, FUTEX_WAIT, 184, NULL```

rcanavan commented 6 years ago

The information given by apc.php seem to be rather confused. The php-fpm master process was started on February 28th, and all php-fpm workers actually have that one as their parent, however, according to apc.php, everything just started pretty much exactly when the processes started to hang, and the cache is pristine:

APCu Version	5.1.9
PHP Version	7.1.13
APCu Host	localhost (fit-asp14-sto04) (::1)
Server Software	Apache
Shared Memory	1 Segment(s) with 256.0 MBytes (mmap memory)
Start Time	2018/03/13 16:10:37
Uptime	56 minutes

Cached Variables	1 ( 1.2 KBytes)
Hits	0
Misses	294
Request Rate (hits, misses)	0.00 cache requests/second
Hit Rate	0.00 cache requests/second
Miss Rate	0.09 cache requests/second
Insert Rate	0.00 cache requests/second
Cache full count	1

Memory Usage(multiple slices indicate fragments)	Hits & Misses
Free: 255.8 MBytes (99.9%)	Hits: 0 (0.0%)
Used: 229.6 KBytes (0.1%)	Misses: 294 (100.0%)

Fragmentation: 0.00% ( 0.0 Bytes out of 255.8 MBytes in 2 fragments)

dugwood commented 6 years ago

@rcanavan the PID may change if it's another lock issue. For example you may have 1234 that's locking every other thread, then when the lock is released you'll have 2345 that's locking all threads.

As PHP-FPM children dies (after a certain amount of hits most of the time, so it's a natural death), they are created again. Or if you have too many idle threads (according to your PHP-FPM configuration, you can't have more than XX idle threads, or less than YY), the master process will create or destroy threads. So new PID, but the PHP-FPM master process will keep the same PID.

Perhaps APCu is reporting the date of creation of your PHP-FPM thread, not it's parent. If you refresh the page and see various dates, that must be it. So it's not important I think.

rcanavan commented 6 years ago

We're logging the PID for all requests in our php-fpm-access.log, and the numbers given by strace in the futex() call aren't in any of logs up to 14 days back, but most of the PIDs of the running php-fpm processes are in the current log (except for the unlucky processes that were started and immediately ran into the deadlock). With pm.max_requests = 500 we're currently getting process lifetimes of about 5 minutes.

The 56 minutes Uptime are suspicious, that should be 14 days. It can't be the start time of the current process, since all processes were deadlocked and I had to kill a few to get my http request processed - otherwise I'd run into an apache timeout.

nikic commented 6 years ago

@rcanavan The start time is reset whenever the cache is (explicitly or automatically) cleared.

dugwood commented 6 years ago

@nikic thanks, I was mistaken on this one. I have 1 week, 3 days, 11 hours and 8 minutes for a PHP-FPM started 10d 11h 9m 34s ago, so that's okay (as I never hit the clear state).

@rcanavan this is the PID at the exact time you have a lock. If the thread disappears, and release the lock, you won't see it. But if it last a few seconds, you must be able to see every file used by the thread by issuing:

lrwx------ 1 www-data www-data 64 mars  13 19:21 /proc/12362/fd/0 -> socket:[108105298]
lrwx------ 1 www-data www-data 64 mars  13 19:21 /proc/12362/fd/1 -> /dev/null
lrwx------ 1 www-data www-data 64 mars  13 19:21 /proc/12362/fd/2 -> /dev/null
lrwx------ 1 www-data www-data 64 mars  13 19:21 /proc/12362/fd/3 -> /tmp/.ZendSem.nJyjfo (deleted)

or view the status:

Name: php-fpm7.0
Umask: 0022
State: S (sleeping)
Tgid: 12362
Ngid: 0
Pid: 12362
PPid: 31110
TracerPid: 0

or run a strace on the PID to find out what it's doing.

But I agree that without a crash, it's harder to find what's going on.

rcanavan commented 6 years ago

@rcanavan The start time is reset whenever the cache is (explicitly or automatically) cleared.

OK, so the seems busy (you may need to increase pm.start_servers messages start 4 seconds after the reset and repeat every second until server reached pm.max_children.

@rcanavan this is the PID at the exact time you have a lock. If the thread disappears, and release the lock, you won't see it. But if it last a few seconds, you must be able to see every file used by the thread by issuing:

That would be somewhat surprising. If a PHP-FPM process ever had successfully processed a request, it should have left a line in our phpfpm-access.log, with a corresponding field pid:"6399". If the process was still running - well I should be able to find it with ps and inspect it with strace and gdb. If it crashed, I'd expect something like child 16937 exited on signal ... in the php_error.log. I see no such trace of the alleged PID, or for that matter, anything even close in any of the logs. Given how quickly we churn through PHP-FPM children (unlike any other processes on those servers), some nearby PIDs should be present somewhere close in time in any of the logs, but they are not.

nikic commented 6 years ago

@dugwood @rcanavan I've just checked the glibc rwlock implementation to find out what exactly it is FUTEX_WAITing for. As @rcanavan already suspected, this is not the pid. In fact, the value is pretty much meaningless, it is a counter that is incremented every time a FUTEX_WAKEUP is performed. So the only thing this number tells you is how many times there was an attempt to acquire a write lock that was not immediately satisfied. The tid of the lock holder (in case of a write lock that is) is stored in __data.__writer.

rcanavan commented 6 years ago

And __data.__writer = 0 in my gdb backtrace above, so that won't help us find the cause. However __nr_readers = 4294967293 still smells fishy.

nikic commented 6 years ago

@rcanavan Assuming that this is really the same issue as in #259, those values look sensible. The problem sequence is a wlock wlock wunlock wunlock on one thread. The second wlock will EDEADLK, thus the first wunlock will unlock the write lock (setting writer=0). The second wunlock will then decrease nr_readers, making it go negative. So at least writer=0, nr_readers=-1 would make sense in the situation. Not totally sure about the -3 though.

nikic commented 6 years ago

The reason for the -3 readers would be that the rwlock implementation has overflow protection for the number of readers. So when writer=0, nr_readers=-1 trying to acquire an rlock will EAGAIN, so that the conjugated wunlock ends up decrementing __nr_readers further.

dugwood commented 6 years ago

@nikic you're right, that's not a PID! It's the file descriptor of the PID you're strace'ing. So if you're running a strace on PID=1234, and the futex says 5678, you should look at /proc/1234/fd/5678. Issuing a cat on this file will reveal the blocking thread/process.

@rcanavan note that ps aux don't always reveal all threads. For example I use Varnish, and ps auxf will show at most 3 or 4 master processes. But if I run pstree, then I'll see 800 cache-worker (each as a dedicated pid), and many others. But you're right, for the PHP-FPM processes, I can see all of them in ps aux.

nikic commented 6 years ago

I've added checks for failed wlock operations in https://github.com/krakjoe/apcu/commit/885ea551d7ee548189c868e58873b2610d483043. If a wlock fails we print a warning and abort the operation. That will at least prevent us from corrupting the lock state if a deadlock is detected.

lichunqiang commented 6 years ago

@nikic So using the master branch to test this error?

We had this issue, the behavior is:

php-fpm.slow.log latest N lines is about apcu_store, and can not handle any requests(nginx 499).

And the php-fpm.log shown that reached the max execution time, terminated. But there is no slow log printed anymore.

nikic commented 6 years ago

@lichunqiang Which version are you currently using? I'd suggest trying the 5.1.11 release first (which definitely fixes a apcu_store deadlock). The additional changes on master may help with debugging and limiting damage, but they don't fix any specific issue.

lichunqiang commented 6 years ago

@nikic APCu 5.1.9 and PHP7.1.10

Thanks for your opinion, we will try the latest 5.1.11 release.

nikic commented 6 years ago

Does anybody here still experience deadlock issues with APCu 5.1.11? If so, I'd suggest trying the new 5.1.12 release as well.

dugwood commented 6 years ago

@nikic I've just installed 5.1.12 (without disable-rw-locks argument for compilation), and still have big issues with APCiterator. If I run cache invalidation requests concurrently, everything goes down in a few seconds. For example:

$apc = new APCUIterator('/cache-server-[0-9]+-timed-key-[0-9]+/', APC_ITER_KEY, 1000, APC_LIST_ACTIVE);
while (($entry = $apc->current()))
{
    $apc->next();
    if (apcu_delete($entry['key']))
    {
        $result++;
    }
}

will crash my server (with 5.1.9 or 5.1.12) when 3 threads are running on this code (perhaps only 2 will suffice, I don't know exactly).

It's not a crash per say, but requests are piling up. So far my options are:

run only one script at a time
add the real timestamp value instead of [0-9]+ to run an apcu_delete instead of the iterator (as I can find out what the timestamp is, it's stored in APCu)

Any clue? Or I have to reproduce in development hoping to get a crash and backtrace?

dugwood commented 6 years ago

An old virtual machine was still using 5.1.9 with the disable-rw-locks option, and crashed the same way. So RW locks shouldn't be the culprit.

mightydok commented 6 years ago

Does anybody here still experience deadlock issues with APCu 5.1.11? If so, I'd suggest trying the new 5.1.12 release as well.

Yes, just today we have apc_store deadlock at our production with 5.1.11 and php 7.2.7-1+ubuntu16.04 Will try 5.1.12 today.

shrey-sinha commented 6 years ago

Hi @mightydok

Did upgrade to 5.1.12 solve your issue? We upgraded to 5.1.12 but still having the deadlock issue.

nikic commented 6 years ago

For anyone still seeing a deadlock, would it be possible to provide a stack trace from a deadlocked process?

shrey-sinha commented 6 years ago

Hi @nikic

attaching gdb output from two stuck processes:

[Thread debugging using libthread_db enabled] Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1". 0x00007f3cfdd4d450 in futex_wait (private=, expected=741294, futex_word=0x7f3c9fe78094) at ../sysdeps/unix/sysv/linux/futex-internal.h:61 61 ../sysdeps/unix/sysv/linux/futex-internal.h: No such file or directory.

[New LWP 22170] [New LWP 22171] [Thread debugging using libthread_db enabled] Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1". 0x00007f3cfdd4d450 in futex_wait (private=, expected=687411, futex_word=0x7f3c9fe78094) at ../sysdeps/unix/sysv/linux/futex-internal.h:61 61 ../sysdeps/unix/sysv/linux/futex-internal.h: No such file or directory.

This ultimately leads to MaxRequestWorkers reached error in Apache and this resolves only after restarting Apache.

PHP Version: 7.0.30-1~dotdeb+8.1 APCu Version: 5.1.12 Kernel: 3.16.36-1+deb8u1

LionsAd commented 5 years ago

Hi all:

For anyone running into deadlocks:

Please update to at least glibc 2.26 or use --disable-apcu-rwlocks.

Reason:

I debugged with glibc 2.23 and was able to consistently reproduce the deadlocks.

However in my case the writer sits before the non-taken lock and just not grabs it or just not releases it.

This might be due to the case of it not receiving a special signal to be scheduled again, because gdb'ing the process and then calling continue usually fixes the problem -- even though they did make zero progress before.

However in glibc-2.26 the rwlock implementation has been completely rewritten using cas semantics and looks overall more robust.

Therefore testing with newer glibc is strongly recommended.

krakjoe / apcu

Deadlock in apcu_store() #224