krakjoe / apcu

APCu - APC User Cache
Other
964 stars 196 forks source link

Use mutex for SMA lock #338

Closed nikic closed 5 years ago

nikic commented 5 years ago

Currently the lock for the shared memory allocator also uses an rwlock. However rlocks are only used for getting allocator statistics, which is not a common operation. As such, this lock is almost entirely used for write locks. It would be more efficient to directly use a mutex instead.

The main part of making this work would be to change the lock abstraction to support both rwlocks and mutexes at the same time, rather than just one of them.

LionsAd commented 5 years ago

@nikic Can we also remove all the cruft from the apc_lock.c while on it? It feels very hard to work with right now ...

I would suggest we have:

for mutex locks vs. RWLOCKS

LionsAd commented 5 years ago

I have that implemented, benchmarks for ab -n80 -c40 with a 50000 write loop of 21 different keys:

Before: 8.795 seconds After: 7.336 seconds

So 20% faster out of the box.

Right now the main wlock is the bottleneck however, but with that removed speed near lockless is achieved:

Well worth it.

LionsAd commented 5 years ago

It can well be that changing the write lock for the SMA to a mutex will fix the remaining deadlocks.

Currently I am able to consistently make APCu slow / deadlock (?) when removing all locks from write / read hashtable and use rwlocks for SMA ...

LionsAd commented 5 years ago

But with mutex() that never happens.

nikic commented 5 years ago

It can well be that changing the write lock for the SMA to a mutex will fix the remaining deadlocks.

Currently I am able to consistently make APCu slow / deadlock (?) when removing all locks from write / read hashtable and use rwlocks for SMA ...

That seems very weird to me. If the rwlock is used exclusively for write-locks I would expect that rwlock and mutex (apart from perf characteristics) will work the same way. The only difference here I can think of is that the mutex is constructed as recursive, while rwlocks are not write-recursive (only read-recursive). However I don't see any place in the SMA code where we could lock recursively.

LionsAd commented 5 years ago

That seems very weird to me. If the rwlock is used exclusively for write-locks I would expect that rwlock and mutex (apart from perf characteristics) will work the same way. The only difference here I can think of is that the mutex is constructed as recursive, while rwlocks are not write-recursive (only read-recursive). However I don't see any place in the SMA code where we could lock recursively.

I will try it again now that tests are passing.

LionsAd commented 5 years ago

@nikic Yes, I can reproduce it.

I applied the following diff:

diff --git a/apc_lock.c b/apc_lock.c
index 562d173..62f1324 100644
--- a/apc_lock.c
+++ b/apc_lock.c
@@ -206,6 +206,7 @@ PHP_APCU_API zend_bool apc_lock_create(apc_lock_t *lock) {
 }

 PHP_APCU_API zend_bool apc_lock_rlock(apc_lock_t *lock) {
+return 1;
 #ifndef PHP_WIN32
 #ifndef APC_SPIN_LOCK
 # ifndef APC_FCNTL_LOCK
@@ -261,6 +262,7 @@ static inline zend_bool apc_lock_wlock_impl(apc_lock_t *lock) {

 PHP_APCU_API zend_bool apc_lock_wlock(apc_lock_t *lock) {
    HANDLE_BLOCK_INTERRUPTIONS();
+return 1;
    if (apc_lock_wlock_impl(lock)) {
        return 1;
    }
@@ -271,6 +273,7 @@ PHP_APCU_API zend_bool apc_lock_wlock(apc_lock_t *lock) {
 }

 PHP_APCU_API zend_bool apc_lock_wunlock(apc_lock_t *lock) {
+return 1;
 #ifndef PHP_WIN32
 #ifndef APC_SPIN_LOCK
 # ifndef APC_FCNTL_LOCK
@@ -298,6 +301,7 @@ PHP_APCU_API zend_bool apc_lock_wunlock(apc_lock_t *lock) {
 }

 PHP_APCU_API zend_bool apc_lock_runlock(apc_lock_t *lock) {
+return 1;
 #ifndef PHP_WIN32
 #ifndef APC_SPIN_LOCK
 # ifndef APC_FCNTL_LOCK
diff --git a/apc_mutex.c b/apc_mutex.c
index 1f7a25d..1bec6d4 100644
--- a/apc_mutex.c
+++ b/apc_mutex.c
@@ -20,7 +20,7 @@
 #ifdef APC_HAS_PTHREAD_MUTEX

 static zend_bool apc_mutex_ready = 0;
-static pthread_mutexattr_t apc_mutex_attr;
+static pthread_rwlockattr_t apc_mutex_attr;

 PHP_APCU_API zend_bool apc_mutex_init() {
    if (apc_mutex_ready) {
@@ -28,11 +28,11 @@ PHP_APCU_API zend_bool apc_mutex_init() {
    }
    apc_mutex_ready = 1;

-   if (pthread_mutexattr_init(&apc_mutex_attr) != SUCCESS) {
+   if (pthread_rwlockattr_init(&apc_mutex_attr) != SUCCESS) {
        return 0;
    }

-   if (pthread_mutexattr_setpshared(&apc_mutex_attr, PTHREAD_PROCESS_SHARED) != SUCCESS) {
+   if (pthread_rwlockattr_setpshared(&apc_mutex_attr, PTHREAD_PROCESS_SHARED) != SUCCESS) {
        return 0;
    }

@@ -45,17 +45,17 @@ PHP_APCU_API void apc_mutex_cleanup() {
    }
    apc_mutex_ready = 0;

-   pthread_mutexattr_destroy(&apc_mutex_attr);
+   pthread_rwlockattr_destroy(&apc_mutex_attr);
 }

 PHP_APCU_API zend_bool apc_mutex_create(apc_mutex_t *lock) {
-   pthread_mutex_init(lock, &apc_mutex_attr);
+   pthread_rwlock_init(lock, &apc_mutex_attr);
    return 1;
 }

 PHP_APCU_API zend_bool apc_mutex_lock(apc_mutex_t *lock) {
    HANDLE_BLOCK_INTERRUPTIONS();
-   if (pthread_mutex_lock(lock) == 0) {
+   if (pthread_rwlock_wrlock(lock) == 0) {
        return 1;
    }

@@ -65,13 +65,13 @@ PHP_APCU_API zend_bool apc_mutex_lock(apc_mutex_t *lock) {
 }

 PHP_APCU_API zend_bool apc_mutex_unlock(apc_mutex_t *lock) {
-   pthread_mutex_unlock(lock);
+   pthread_rwlock_unlock(lock);
    HANDLE_UNBLOCK_INTERRUPTIONS();
    return 1;
 }

 PHP_APCU_API void apc_mutex_destroy(apc_mutex_t *lock) {
-   pthread_mutex_destroy(lock);
+   pthread_rwlock_destroy(lock);
 }

 #endif
diff --git a/apc_mutex.h b/apc_mutex.h
index 1f3ee87..6a63779 100644
--- a/apc_mutex.h
+++ b/apc_mutex.h
@@ -25,7 +25,7 @@

 #include "pthread.h"

-typedef pthread_mutex_t apc_mutex_t;
+typedef pthread_rwlock_t apc_mutex_t;

 PHP_APCU_API zend_bool apc_mutex_init();
 PHP_APCU_API void apc_mutex_cleanup();

based on my branch and it will consistently lockup php-fpm.

LionsAd commented 5 years ago

I think I see a problem:

                        SMA_UNLOCK(sma, i);
                        sma->last = i;

This means that if someone gets a LOCK in on sma->last afterwards it would free a different lock then instead ...

Before getting a lock we need to save sma->last, so that locks are consistently locked and freed.

LionsAd commented 5 years ago

Hmmm, that fixed it for a while, but now it's back to 100% CPU again.

LionsAd commented 5 years ago

No, it's just locked, not busy waiting.

Backtrace 1:

#0  0x00007f972577d730 in futex_wait (private=<optimized out>, expected=7141602, futex_word=0x7f9700ca900c) at ../sysdeps/unix/sysv/linux/futex-internal.h:61
#1  futex_wait_simple (private=<optimized out>, expected=7141602, futex_word=0x7f9700ca900c) at ../sysdeps/nptl/futex-internal.h:135
#2  __pthread_rwlock_wrlock_slow (rwlock=0x7f9700ca9000) at pthread_rwlock_wrlock.c:67
#3  0x00007f972577d918 in __GI___pthread_rwlock_wrlock (rwlock=<optimized out>) at pthread_rwlock_wrlock.c:124
#4  0x00007f9721ad1920 in apc_mutex_lock (lock=<optimized out>) at /root/ext/foo/apcu/apc_mutex.c:58
#5  0x00007f9721ad65ef in apc_sma_malloc_ex (sma=sma@entry=0x7f9721cddcc0 <apc_sma>, n=136, fragment=fragment@entry=40, allocated=allocated@entry=0x7ffd722d3d10)
    at /root/ext/foo/apcu/apc_sma.c:390
#6  0x00007f9721ad6811 in apc_sma_malloc (sma=sma@entry=0x7f9721cddcc0 <apc_sma>, n=<optimized out>) at /root/ext/foo/apcu/apc_sma.c:465
#7  0x00007f9721ad961d in apc_persist (sma=0x7f9721cddcc0 <apc_sma>, serializer=0x0, orig_entry=orig_entry@entry=0x7ffd722d3e30) at /root/ext/foo/apcu/apc_persist.c:424
#8  0x00007f9721ad525c in apc_cache_store (cache=0x555fe39e15b0, key=0x7f972285eb90, val=0x7f972281b150, ttl=<optimized out>, exclusive=exclusive@entry=0 '\000')

Backtrace 2:

#0  0x00007f972577d730 in futex_wait (private=<optimized out>, expected=42254, futex_word=0x7f9700ca900c) at ../sysdeps/unix/sysv/linux/futex-internal.h:61
#1  futex_wait_simple (private=<optimized out>, expected=42254, futex_word=0x7f9700ca900c) at ../sysdeps/nptl/futex-internal.h:135
#2  __pthread_rwlock_wrlock_slow (rwlock=0x7f9700ca9000) at pthread_rwlock_wrlock.c:67
#3  0x00007f972577d918 in __GI___pthread_rwlock_wrlock (rwlock=<optimized out>) at pthread_rwlock_wrlock.c:124
#4  0x00007f9721ad1920 in apc_mutex_lock (lock=<optimized out>) at /root/ext/foo/apcu/apc_mutex.c:58
#5  0x00007f9721ad65ef in apc_sma_malloc_ex (sma=sma@entry=0x7f9721cddcc0 <apc_sma>, n=136, fragment=fragment@entry=40, allocated=allocated@entry=0x7ffd722d3d10)
    at /root/ext/foo/apcu/apc_sma.c:390
#6  0x00007f9721ad6811 in apc_sma_malloc (sma=sma@entry=0x7f9721cddcc0 <apc_sma>, n=<optimized out>) at /root/ext/foo/apcu/apc_sma.c:465
#7  0x00007f9721ad961d in apc_persist (sma=0x7f9721cddcc0 <apc_sma>, serializer=0x0, orig_entry=orig_entry@entry=0x7ffd722d3e30) at /root/ext/foo/apcu/apc_persist.c:424
#8  0x00007f9721ad525c in apc_cache_store (cache=0x555fe39e15b0, key=0x7f9722861bb8, val=0x7f972281e150, ttl=<optimized out>, exclusive=exclusive@entry=0 '\000')
LionsAd commented 5 years ago

This decreases chances of deadlock and eventually the lock resolves:

diff --git a/apc_sma.c b/apc_sma.c
index 24c05d1..8bd7e8e 100644
--- a/apc_sma.c
+++ b/apc_sma.c
@@ -381,15 +381,20 @@ PHP_APCU_API void apc_sma_cleanup(apc_sma_t* sma) {

 PHP_APCU_API void* apc_sma_malloc_ex(apc_sma_t* sma, zend_ulong n, zend_ulong fragment, zend_ulong* allocated) {
    size_t off;
-   uint i;
+   uint i, last;
    int nuked = 0;

 restart:
    assert(sma->initialized);

-   if (!SMA_LOCK(sma, sma->last)) {
+        last = sma->last;
+   if (!SMA_LOCK(sma, last)) {
        return NULL;
    }
+   // WE HAVE THE LOCK, WE DECIDE WHAT IS LAST.
+   if (last != sma->last) {
+       sma->last = last;
+   }

    off = sma_allocate(SMA_HDR(sma, sma->last), n, fragment, allocated);

@@ -398,9 +403,16 @@ restart:
        SMA_UNLOCK(sma, sma->last);
        sma->expunge(
            *(sma->data), (n+fragment));
-       if (!SMA_LOCK(sma, sma->last)) {
+
+       last = sma->last;
+       if (!SMA_LOCK(sma, last)) {
            return NULL;
        }
+       // WE HAVE THE LOCK, WE DECIDE WHAT IS LAST.
+       if (last != sma->last) {
+           sma->last = last;
+       }
+
        off = sma_allocate(SMA_HDR(sma, sma->last), n, fragment, allocated);
    }

@@ -416,7 +428,7 @@ restart:
    SMA_UNLOCK(sma, sma->last);

    for (i = 0; i < sma->num; i++) {
-       if (i == sma->last) {
+       if (i == last) {
            continue;
        }

@@ -437,8 +449,8 @@ restart:
        }
        if (off != -1) {
            void* p = (void *)(SMA_ADDR(sma, i) + off);
-           SMA_UNLOCK(sma, i);
            sma->last = i;
+           SMA_UNLOCK(sma, i);
 #ifdef VALGRIND_MALLOCLIKE_BLOCK
            VALGRIND_MALLOCLIKE_BLOCK(p, n, 0, 0);
 #endif

but it also shows using ->last is not safe.

nikic commented 5 years ago

This means that if someone gets a LOCK in on sma->last afterwards it would free a different lock then instead ...

Before getting a lock we need to save sma->last, so that locks are consistently locked and freed.

That's a very good point. This is definitely a bug. However I think it should be sufficient to move the sma->last = i assignment one line up, before the lock is released. That makes sure that last is never going to change while you hold the lock (unless you change it yourself).

However, unless you are on Windows this shouldn't be making a difference in the end, because you'll be using mmap and there will only be one SMA segment (so last is always 0).

LionsAd commented 5 years ago

@nikic That is really weird, I see that it should be 1, but playing with sma->last definitely plays a role, which it should not if it is always 0.

LionsAd commented 5 years ago

Okay, I tested this some more and what happens is:

[pid 28888] futex(0x7fe765ea900c, FUTEX_WAKE, 1) = 0
[pid 28888] futex(0x7fe765ea900c, FUTEX_WAKE, 1) = 0
[pid 28888] futex(0x7fe765ea900c, FUTEX_WAKE, 1) = 0
[pid 28888] futex(0x7fe765ea900c, FUTEX_WAKE, 1) = 0
[pid 28888] futex(0x7fe765ea900c, FUTEX_WAKE, 1) = 0
[pid 28888] futex(0x7fe765ea900c, FUTEX_WAKE, 1) = 0
[pid 28888] futex(0x7fe765ea900c, FUTEX_WAKE, 1) = 0
[pid 28888] futex(0x7fe765ea900c, FUTEX_WAKE, 1) = 0
[pid 28888] futex(0x7fe765ea900c, FUTEX_WAKE, 1) = 0
[pid 28888] futex(0x7fe765ea900c, FUTEX_WAKE, 1) = 0
[pid 28888] futex(0x7fe765ea900c, FUTEX_WAKE, 1) = 0
[pid 28888] futex(0x7fe765ea900c, FUTEX_WAKE, 1) = 0
[pid 28888] futex(0x7fe765ea900c, FUTEX_WAKE, 1) = 0
[pid 28888] futex(0x7fe765ea900c, FUTEX_WAKE, 1) = 0
[...]

So it seems it signals to others again and again that the FUTEX is available, but no one grabs it ...

nikic commented 5 years ago

I've pushed https://github.com/krakjoe/apcu/commit/1c30ee3395f8bb2bc7796c90adc994223975f6ae to fix the last issue.

nikic commented 5 years ago

After thinking about this more carefully I followed your suggestion and use a local variable for last: https://github.com/krakjoe/apcu/commit/e93ee5c4af00809c4b275495e18c6df1edebde3e Otherwise there would still be a race between the time when sma->last is read and the time when the lock is actually acquired.

LionsAd commented 5 years ago

Thanks - I found out that the rwlock on at least my glibc (2.23) is buggy:

3175
#0  __lll_lock_wait () at ../sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:135
> up 1; print rwlock.__data 
$1 = {__lock = 0, __nr_readers = 0, __readers_wakeup = 0, __writer_wakeup = 14249440, __nr_readers_queued = 0, __nr_writers_queued = 7, __writer = 0, __shared = 0, 
  __rwelision = 0 '\000', __pad1 = "\000\000\000\000\000\000", __pad2 = 0, __flags = 0}

still the process does not take it.

Later:

3175
(gdb) #0  __lll_lock_wait () at ../sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:135
#1  0x00007fcc9858b841 in __GI___pthread_rwlock_wrlock (rwlock=0x7fcc73aa8000)
#2  0x00007fcc948d0935 in apc_mutex_lock (lock=<optimized out>)
#3  0x00007fcc948d56cf in apc_sma_malloc_ex (
3176
(gdb) #0  __lll_lock_wait () at ../sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:135
#1  0x00007fcc9858b841 in __GI___pthread_rwlock_wrlock (rwlock=0x7fcc73aa8000)
#2  0x00007fcc948d0935 in apc_mutex_lock (lock=<optimized out>)
#3  0x00007fcc948d5772 in apc_sma_malloc_ex (

Two processes have entered the "fast track" and are competing against the lock - just when it eventually times out things make progress again ...

But a writer sitting before an empty lock and not taking it is interesting ...

I wonder if that is because there are no readers.

--

I won't bother with that more, because we have mutexes for SMA now, but it was still a really weird issue.

LionsAd commented 5 years ago

Okay, seems we should suggest for anyone using rwlocks to update to at least glibc 2.26, because anything at 2.23 level will have the deadlock problem and between 2.24 and 2.26 there is a handover bug and a mutex bug.

LionsAd commented 5 years ago

@nikic I found the BUG:

[18-Nov-2018 23:11:16 UTC] PHP Warning:  PHP Shutdown: DESTROY LOCK in Unknown on line 0

We do de-initialize and re-initialize the shared memory region again and again and in the course of that we added mutex_destroy() to not be a no-op.

That means the mutex is invalid then ...

But what's worse, the whole shared memory region is invalid then as well for a short moment ...

I think because it uses MMAP (in my version) it just re-assigns the same memory region, so the effect is not directly visible -- except when locks get destroyed ...

I think the way to solve that is to add a global counter around initialized and increase / decrease that, so that the shared memory region is not prematurely removed ...

LionsAd commented 5 years ago

And the difference here is that pthread_rwlock do just spin indefinitely (deadlock) if using an invalid lock(), but pthread_mutex can detect that and will return EINVAL.

nikic commented 5 years ago

Oh. My. God. That explains a lot.

Looking at various destroy/cleanup code, there's even this comment: https://github.com/krakjoe/apcu/blob/master/apc_cache.c#L662 But this unmap is still performed, and it's also not a noop like the lock destroy: https://github.com/krakjoe/apcu/blob/e93ee5c4af00809c4b275495e18c6df1edebde3e/apc_sma.c#L367

I'm not sure if just adding a counter will be enough -- if a new process is forked during a request and that process finishes, it's probably still going to go through MSHUTDOWN (or not?)

LionsAd commented 5 years ago

Yeah, indeed. I am right now trying to understand how extensions should do cleanup ...

OHHHH and you are right, php-fpm can fork() a new child while it's in progress ...

Do we need to keep a pid list of registered users of the SHM?

LionsAd commented 5 years ago

But we do know that PHP will call our INIT and SHUTDOWN functions, so a shared counter in the shared memory should work.

LionsAd commented 5 years ago

But yeah, this means that e.g. the WLOCK could have gotten out-of-sync if it was re-created, etc.

It's kinda a wonder this worked at all ;).

nikic commented 5 years ago

Hrm, I'm somewhat confused now. If we take the implementation as it is right now (no freeing for locks), is there issue or not? I thought the munmap call would be problematic, but I think this is actually only going to unmap it from the address space of the current process, which is fine.

The initialization would be a problem, but only if it's actually rerun (there's the APCG initialized flag). If processes are only ever forked post-initialization (is that true?) this would not be a problem (?)

LionsAd commented 5 years ago

Okay, so just for my own sanity documenting what I know so far:

But so far all works.

Even if a local extension calls SHUTDOWN then it will uninit only it's own process memory (which is fine) EXCEPT that:

LionsAd commented 5 years ago

So yeah, only applying to the new implementation:

What about:

diff --git a/apc_sma.c b/apc_sma.c
index cc3c19a..ab44cc1 100644
--- a/apc_sma.c
+++ b/apc_sma.c
@@ -49,7 +49,6 @@ enum {

 typedef struct sma_header_t sma_header_t;
 struct sma_header_t {
-   apc_mutex_t sma_lock;    /* segment lock */
    size_t segsize;         /* size of entire segment */
    size_t avail;           /* bytes available (not necessarily contiguous) */
 };
@@ -57,7 +56,7 @@ struct sma_header_t {
 #define SMA_HDR(sma, i)  ((sma_header_t*)((sma->segs[i]).shmaddr))
 #define SMA_ADDR(sma, i) ((char*)(SMA_HDR(sma, i)))
 #define SMA_RO(sma, i)   ((char*)(sma->segs[i]).roaddr)
-#define SMA_LCK(sma, i)  ((SMA_HDR(sma, i))->sma_lock)
+#define SMA_LCK(sma, i)  (sma->sma_lock)

 #define SMA_CREATE_LOCK  APC_CREATE_MUTEX
 #define SMA_DESTROY_LOCK APC_DESTROY_MUTEX
@@ -301,6 +300,7 @@ PHP_APCU_API void apc_sma_init(apc_sma_t* sma, void** data, apc_sma_expunge_f ex
    sma->size = size > 0 ? size : DEFAULT_SEGSIZE;

    sma->segs = (apc_segment_t*) pemalloc(sma->num * sizeof(apc_segment_t), 1);
+   SMA_CREATE_LOCK(&sma->sma_lock);

    for (i = 0; i < sma->num; i++) {
        sma_header_t*   header;
@@ -327,7 +327,6 @@ PHP_APCU_API void apc_sma_init(apc_sma_t* sma, void** data, apc_sma_expunge_f ex
        shmaddr = sma->segs[i].shmaddr;

        header = (sma_header_t*) shmaddr;
-       SMA_CREATE_LOCK(&header->sma_lock);
        header->segsize = sma->size;
        header->avail = sma->size - ALIGNWORD(sizeof(sma_header_t)) - ALIGNWORD(sizeof(block_t)) - ALIGNWORD(sizeof(block_t));

@@ -365,9 +364,9 @@ PHP_APCU_API void apc_sma_cleanup(apc_sma_t* sma) {
    uint i;

    assert(sma->initialized);
+   SMA_DESTROY_LOCK(&sma->sma_lock);

    for (i = 0; i < sma->num; i++) {
-       SMA_DESTROY_LOCK(&SMA_LCK(sma, i));
 #if APC_MMAP
        apc_unmap(&sma->segs[i]);
 #else
diff --git a/apc_sma_api.h b/apc_sma_api.h
index b04faf2..6f50fce 100644
--- a/apc_sma_api.h
+++ b/apc_sma_api.h
@@ -26,6 +26,9 @@

 #ifndef HAVE_APC_SMA_API_H
 #define HAVE_APC_SMA_API_H
+
+#include "apc_mutex.h"
+
 /* {{{ SMA API
    APC SMA API provides support for shared memory allocators to external libraries ( and to APC )
    Skip to the bottom macros for error free usage of the SMA API
@@ -75,6 +78,7 @@ typedef struct _apc_sma_t {

    /* segments */
    apc_segment_t* segs;           /* segments */
+        apc_mutex_t sma_lock;          /* lock */
 } apc_sma_t; /* }}} */

 /*
nikic commented 5 years ago

The mutex has to be part of shared memory, otherwise every process will just have its own independent mutex for which it can always acquire a lock.

LionsAd commented 5 years ago

I thought so, too - but that would have meant the normal mutex would be broken, too. I think the mutex works fine if you create it once as PSHARED and then fork(), which is what we do in apc_cache.c.

LionsAd commented 5 years ago

This was implemented and we finished discussion, so I suggest we close this now.