Blocking mutexes - Githubissues

axw commented 6 years ago

Update mutex implementations to make blocking syscalls, rather than polling. This should ensure fairer scheduling.

On Windows, we now use named mutexes.

On Linux and MacOS, we use blocking flock. Due to the way signals work in Go we cannot interrupt a flock syscall, so instead we allow the syscall to complete and then release the lock if nothing is waiting for it. To avoid an unbounded growth of goroutines making flock syscalls, we ensure that only one goroutine is making a call at a time.

For compatibility with older implementations, the Linux implementation also acquires an abstract domain socket. On Windows, we do the same with a named semaphore.

Tested on Linux and Wine (Windows).

axw commented 6 years ago

Blocking juju/mutex

axw commented 6 years ago

To test the fairness, I wrote this little program:

package main                                                                    

import (                                                                        
        "fmt"                                                                   
        "sync/atomic"                                                           
        "time"                                                                  

        "github.com/juju/mutex"                                                 
        "github.com/juju/utils/clock"                                           
)                                                                               

func main() {                                                                   
        spec := mutex.Spec{                                                     
                Name:  "foo",                                                   
                Clock: clock.WallClock,                                         
                Delay: 250 * time.Millisecond,                                  
        }                                                                       

        var count uint64                                                        
        go func() {                                                             
                ticker := time.NewTicker(5 * time.Second)                       
                last := time.Now()                                              
                for now := range ticker.C {                                     
                        count := atomic.SwapUint64(&count, 0)                   
                        fmt.Println(count, "acquisitions over", now.Sub(last))  
                        last = now                                              
                }                                                               
        }()                                                                     

        for {                                                                   
                m, err := mutex.Acquire(spec)                                   
                if err != nil {                                                 
                        panic(err)                                              
                }                                                               
                atomic.AddUint64(&count, 1)                                     
                time.Sleep(100 * time.Millisecond)                              
                m.Release()                                                     
        }                                                                       
}

Running two instances side by side. With the old mutex implementation, one process would mostly be starving out the other. With the new implementation, they're fairly equal.

jameinel commented 6 years ago

I can confirm the fairness on Mac OSX. With the old code and 2 processes it was: 0 acquisitions over 5s vs 50 acquisitions over 5s pretty much entirely. And then it might switch to the other process, but then it would stay at 50 and the original one would be starved.

With the new implementation it says 24-26 for both sides.

jameinel commented 6 years ago

I confirmed that the update code was not fair on Windows, but the test suite does not pass: C:\dev\go\src\github.com\juju\mutex [mutex-blocking +1 ~0 -0 !]> go test --check.v PASS: mutex_test.go:152: mutexSuite.TestDifferentNamesDontBlock 0.001s PASS: mutex_test.go:105: mutexSuite.TestLockAfterTimeout 0.001s

FAIL: mutex_test.go:166: mutexSuite.TestLockBlocks

mutex_test.go:176: if c.Check(err, gc.IsNil) { ... r.Release() } ... value errors.Err = &errors.Err{message:"", cause:(errors.Err)(0xc04209c000), previous:(*errors.Err)(0xc04209c000), file:"github.com/juju/mutex/legacy_mutex_windows.go", line:37} ("timeout acquiring mutex")

PASS: mutex_test.go:123: mutexSuite.TestLockContentionWithinProcessCancel 0.000s PASS: mutex_test.go:92: mutexSuite.TestLockContentionWithinProcessTimeout 0.001s PASS: mutex_test.go:86: mutexSuite.TestLockNoContention 0.000s

FAIL: mutex_test.go:195: mutexSuite.TestProcessReleasesWhenDead

mutex_test.go:206: if c.Check(err, gc.IsNil) { ... r.Release() } ... value errors.Err = &errors.Err{message:"", cause:(errors.Err)(0xc04209c000), previous:(*errors.Err)(0xc04209c000), file:"github.com/juju/mutex/legacy_mutex_windows.go", line:37} ("timeout acquiring mutex")

mutex_test.go:219: c.Fatalf("timout waiting for mutex to be acquired") ... Error: timout waiting for mutex to be acquired

PASS: mutex_test.go:145: mutexSuite.TestSecondReleaseFine 0.001s PASS: mutex_test.go:33: mutexSuite.TestSpecValidity 0.000s PASS: mutex_test.go:225: mutexSuite.TestStress 0.043s OOPS: 8 passed, 2 FAILED --- FAIL: Test (5.10s) FAIL exit status 1 FAIL github.com/juju/mutex 5.134s

And it seems to have a bug, because both sides ended up with: C:\dev\go\src\github.com\juju\mutex\fair [mutex-blocking +1 ~0 -0 !]> go run .\fair.go 13 acquisitions over 5.0014538s 0 acquisitions over 5.00037s

(both left and right gave 0 acquisitions / 5s which means something is just broken)

jameinel commented 6 years ago

Interesting note. once I killed the original process, the second process started to be able to acquire the lock.

So I ran it again, and if I just run 1 script, then it manages to get the lock and says 50/5s reliably. However, as soon as I start a second one they both go to 0/5s. Stopping the second one does not allow the first one to recover. Stopping the first one allows the second one to go into 50/5s.

So it seems that contention causes the one that has had the lock to try to let go but not actually let go, until that process actually dies.

axw commented 6 years ago

@jameinel thanks for picking that up.

The issue was that I had mistakenly dropped a call to ReleaseMutex. The owning thread needs to call ReleaseMutex before closing the handle. I thought closing it would be enough, but apparently not.

howbazaar commented 6 years ago

How does the fairness fare if we drop the wait from release to acquire to 10ms or 1ms?

jameinel commented 6 years ago

'go test' passes on Windows, and 'go run fair\fair.go' shows the two sides fairly sharing the mutex.

On Wed, Nov 8, 2017 at 1:45 PM, Tim Penhey notifications@github.com wrote:

How does the fairness fare if we drop the wait from release to acquire to 10ms or 1ms?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/juju/mutex/pull/4#issuecomment-342764746, or mute the thread https://github.com/notifications/unsubscribe-auth/AAMYfVOOvtfJY5E2h-EEjFhWQcU-4AIrks5s0XhJgaJpZM4QV4CD .

axw commented 6 years ago

Updated to remove the "active" mutex tracking on Windows, since we have to lock to an OS thread anyway, which prevents reentrancy.

jameinel commented 6 years ago

Are you talking about the code in Andrew's manual test case? I tried running 2 clients, each with a 10ms sleep. And on Windows they settled down to both get 250 acq per second. I changed it to 1 client running 10ms (from before), and one with a 1ms sleep. They stabilized at 400-450 acq per 5s. So even when one holds it longer than the other, they still stay fair. Running 3, 2 @1ms, 1@10ms, and they all stayed around 400 acq/5s.

This is all on Windows.

On Wed, Nov 8, 2017 at 1:45 PM, Tim Penhey notifications@github.com wrote:

How does the fairness fare if we drop the wait from release to acquire to 10ms or 1ms?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/juju/mutex/pull/4#issuecomment-342764746, or mute the thread https://github.com/notifications/unsubscribe-auth/AAMYfVOOvtfJY5E2h-EEjFhWQcU-4AIrks5s0XhJgaJpZM4QV4CD .

jameinel commented 6 years ago

Running 3 Windows fair.go processes at 1ms sleep got 1500 acq/5s across all three.

Do we have reasonable tests around behavior when we've decided that we don't need the lock anymore and we get it anyway?

On Wed, Nov 8, 2017 at 1:50 PM, Andrew Wilkins notifications@github.com wrote:

Updated to remove the "active" mutex tracking on Windows, since we have to lock to an OS thread anyway, which prevents reentrancy.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/juju/mutex/pull/4#issuecomment-342766083, or mute the thread https://github.com/notifications/unsubscribe-auth/AAMYfUCxHesyo3Lqw7Aw9Zgsbxso-Gs_ks5s0XmBgaJpZM4QV4CD .

axw commented 6 years ago

@howbazaar it fluctuates quite a lot more, but is still unfair with the old code. With the new code, it's fair either way.

jameinel commented 6 years ago

With the old code and 1ms sleep (instead of 100ms), I see it bounce between the two, where one wanders down to 430 acq, and then up to 2825 acq, and then back down again. So it doesn't starve, but it isn't stable. The new code is very close to perfectly fair.

On Wed, Nov 8, 2017 at 1:58 PM, Andrew Wilkins notifications@github.com wrote:

@howbazaar https://github.com/howbazaar it fluctuates quite a lot more, but is still unfair with the old code. With the new code, it's fair either way.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/juju/mutex/pull/4#issuecomment-342767919, or mute the thread https://github.com/notifications/unsubscribe-auth/AAMYfanbpsEtaZzA1VCudqgHz8Xw2EZbks5s0Xs8gaJpZM4QV4CD .

axw commented 6 years ago

Do we have reasonable tests around behavior when we've decided that we don't need the lock anymore and we get it anyway?

I think the test I added (TestLockAfterTimeout) covers this reasonably well. It's non-deterministic in that the second (timed out) acquire may wake up before or after the final acquire has started. I don't think we can force a particular path, but I'm open to ideas of course.

Are there any particular cases where you think we're lacking?

juju / mutex

Blocking mutexes #4