liuqian1990 / memcached

Automatically exported from code.google.com/p/memcached
0 stars 0 forks source link

assoc_maintenance_thread unlocks mutex without locking first #331

Closed GoogleCodeExporter closed 9 years ago

GoogleCodeExporter commented 9 years ago
Checked in memcached-1.4.15.

assoc_maintenance_thread() (assoc.c) has a piece of code with:

slabs_rebalancer_resume();
....
slabs_rebalancer_pause();

The problem is that slabs_rebalancer_resume() is a mutex unlock() operation - 
and that is not allowed (the pthread declares its behavior as "undefined") 
before the mutex is ever locked...

I guess we need to call slabs_rebalancer_pause() in the beginning of 
assoc_maintenance_thread()?

Original issue reported on code.google.com by nyha...@gmail.com on 3 Jul 2013 at 8:27

GoogleCodeExporter commented 9 years ago
Did this cause a bug or is this from an audit?

That little bit of code in the assoc thing was really awkward and feels wrong 
to me, but that's where it ended.

The unlock is a bit of a bootstrapping issue that might go away if it's 
restructured a little to use a "once on startup" flag. It unlocks right before 
the thread sits on a condition, because when it wakes up from the condition the 
first thing it does is lock the slab mover.

So if you add a pause before that resume it'll just deadlock when the expander 
actually runs the first time.

If it's causing some bug we'll see about fixing it sooner than later, though?

Original comment by dorma...@rydia.net on 3 Jul 2013 at 8:36

GoogleCodeExporter commented 9 years ago
It caused a bug, but not on Linux (where the spurious pthread_mutex_unlock() 
seems to be silently ignored) but rather a new operating system, whose 
pthread_mutex_unlock() croaks if called on an already-unlocked mutex.

What if we add a call to pause in the *beginning* of the 
assoc_maintenance_thread() function (not right before the resume) - wouldn't 
this be the right fix? Then, any time the assoc_maintenance_thread() function 
is working, the rebalancer is paused - except during the wait where the 
rebalancer is explicitly resumed.

Original comment by nyha...@gmail.com on 3 Jul 2013 at 8:57

GoogleCodeExporter commented 9 years ago
For the record, here is a one-line patch I've been using to fix this bug, and 
it seems to be working well:

--- src/assoc.c 2013-11-18 13:00:26.080026394 +0200
+++ src-save/assoc.c    2013-07-03 11:27:41.113465014 +0300
@@ -203,6 +203,7 @@

 static void *assoc_maintenance_thread(void *arg) {

+    slabs_rebalancer_pause();
     while (do_run_maintenance_thread) {
         int ii = 0;

Original comment by n...@cloudius-systems.com on 13 May 2014 at 5:50

GoogleCodeExporter commented 9 years ago
fixed in 'next'.. no longer calls resume without pausing first.

Original comment by dorma...@rydia.net on 28 Dec 2014 at 3:18