golang / go

The Go programming language
https://go.dev
BSD 3-Clause "New" or "Revised" License
122.94k stars 17.53k forks source link

runtime: improve scaling of lock2 #68578

Open rhysh opened 1 month ago

rhysh commented 1 month ago

Part of my experience with runtime.mutex values is that process-wide demand for a single mutex can slow the program to a crawl. I don't think that alone will come as a surprise, though the magnitude of slowdown is more than I'd expect. The main surprise is that programs can have a hard time recovering from that performance cliff once they've fallen off.

I've described a case of "larger magnitude than I expected" as #57070. I've described (perhaps indirectly) "performance cliff that defies self-recovery" as #61426, #61427, and #61428. The runtime's ChanContended benchmark shows clear degradation with increasing GOMAXPROCS, which I suspect is the same problem.


Here's what I think is happening:

For some users of runtime.mutex values, the critical section is very fast. When exiting the critical section, unlock2 checks to see if there's another thread that has gone to sleep while waiting for the mutex. If unlock2 finds one, it wakes a single lock2 sleeper. While waking the other thread, unlock2 removes that sleeper's note within the mutex. That looks different in lock_sema.go and lock_futex.go, either popping a single entry off of the linked list or a full state reset to 0.

Now, the previous user of the mutex (recent caller of unlock2) is awake, and the lock2 caller that it woke is awake. If the previous user needs the same mutex again, it enters lock2 and we now have two non-sleeping threads attempting to acquire the mutex. Whichever one wins, it will soon call unlock2. If the caller of unlock2 finds that another thread is sleeping in lock2 waiting for this mutex, it will wake that thread.

Given enough time, a caller of lock2 will give up on the procyield/osyield section and go to sleep. But, how long does it take a thread to give up? And how long does it take a thread to complete another pass through the critical section, call unlock2, and wake an additional thread?

It looks like the time a thread in lock2 takes to go to sleep is over 100 times longer than the runtime's fastest critical sections. Preliminary measurements on an M1 MacBook Air with darwin/arm64 show ~6000 ns of spinning between sleeps and ~14 ns for an uncontended channel operation. That means, I think, that nearly all of the threads that need a runtime.mutex for a quick critical section will be awake the whole time. Being awake means repeatedly loading (about 5 times per wake cycle) and attempting CAS on the mutex value.

As more threads demand access to the mutex's cache line, updating it becomes more expensive. Acquiring the mutex involves updating that cache line. Going to sleep and waking sleepers also require updates to that cache line.


BenchmarkChanContended does 100 sends and 100 receives on a buffered channel with no channel-based waiting. That's 200 lock2/unlock2 pairs per "op". Each additional thread allowed by GOMAXPROCS reduces the whole-process throughput (with minor exceptions near the limit of physical cores, of around 1%).

Below is the behavior I see on an Intel i7-13700H (linux/amd64, 6 P cores with 2 threads each plus 8 E cores). When the process is allowed to use 4 threads, the whole-process throughput is 1/2 of what it was with 1 thread. With 8 threads, throughput is halved again. With 12, it's halved yet again. At GOMAXPROCS=20 the 200 channel operations take an average of 44 µs, so on average there's an unlock2 call every 220 ns, each with a fresh opportunity to wake a sleeping thread.

$ go version
go version go1.23rc2 linux/amd64
$ go test runtime -test.run='^$' -test.bench=ChanContended -test.cpu="$(seq 1 20 | tr '\n' ',')" -test.count=10
``` goos: linux goarch: amd64 pkg: runtime cpu: 13th Gen Intel(R) Core(TM) i7-13700H │ - │ │ sec/op │ ChanContended 3.167µ ± 1% ChanContended-2 4.723µ ± 6% ChanContended-3 5.563µ ± 3% ChanContended-4 6.527µ ± 2% ChanContended-5 7.848µ ± 4% ChanContended-6 8.683µ ± 1% ChanContended-7 11.19µ ± 0% ChanContended-8 13.96µ ± 1% ChanContended-9 17.15µ ± 5% ChanContended-10 20.17µ ± 3% ChanContended-11 24.14µ ± 0% ChanContended-12 29.16µ ± 1% ChanContended-13 32.72µ ± 4% ChanContended-14 36.23µ ± 0% ChanContended-15 36.10µ ± 1% ChanContended-16 37.73µ ± 5% ChanContended-17 37.59µ ± 4% ChanContended-18 41.37µ ± 6% ChanContended-19 42.87µ ± 1% ChanContended-20 44.36µ ± 2% geomean 17.43µ ```

And here's the behavior I see on an M1 MacBook Air (darwin/arm64, 4 P cores plus 4 E cores). With 5 threads, throughput is less than half of what the program can do with one thread.

$ go version
go version go1.23rc2 darwin/arm64
$ go test runtime -test.run='^$' -test.bench=ChanContended -test.cpu="$(seq 1 8 | tr '\n' ',')" -test.count=10
``` goos: darwin goarch: arm64 pkg: runtime cpu: Apple M1 │ - │ │ sec/op │ ChanContended 2.639µ ± 1% ChanContended-2 3.303µ ± 6% ChanContended-3 4.548µ ± 1% ChanContended-4 5.041µ ± 0% ChanContended-5 5.788µ ± 1% ChanContended-6 6.171µ ± 0% ChanContended-7 6.470µ ± 0% ChanContended-8 6.405µ ± 0% geomean 4.829µ ```

Another perspective is to consider how much on-CPU time the process has. The data below show that during 1.78 seconds of wall-clock time, the process's 20 threads were on-CPU for a total of 27.74 seconds within lock2 calls. Those threads were not sleeping!

``` $ go test runtime -test.run='^$' -test.bench=ChanContended -test.cpu=20 -test.count=1 -test.cpuprofile=/tmp/p goos: linux goarch: amd64 pkg: runtime cpu: 13th Gen Intel(R) Core(TM) i7-13700H BenchmarkChanContended-20 26667 44404 ns/op PASS ok runtime 1.785s $ go tool pprof -peek runtime.lock2 /tmp/p File: runtime.test Type: cpu Time: Jul 24, 2024 at 8:45pm (UTC) Duration: 1.78s, Total samples = 31.32s (1759.32%) Showing nodes accounting for 31.32s, 100% of 31.32s total ----------------------------------------------------------+------------- flat flat% sum% cum cum% calls calls% + context ----------------------------------------------------------+------------- 27.74s 100% | runtime.lockWithRank 4.57s 14.59% 14.59% 27.74s 88.57% | runtime.lock2 19.50s 70.30% | runtime.procyield 2.74s 9.88% | runtime.futexsleep 0.84s 3.03% | runtime.osyield 0.07s 0.25% | runtime.(*lockTimer).begin 0.02s 0.072% | runtime.(*lockTimer).end ----------------------------------------------------------+------------- ```

In short, we have lock2 implementations that appear on the page to let threads sleep but which in practice make them all spin, and spinning threads at least correlate with (and I think also cause) slower lock handoffs.

I think we can do better. I have some ideas on how, but at this point I'd love feedback on whether others agree with my analysis and on the state of the art here today.

CC @golang/runtime

gabyhelp commented 1 month ago

Related Issues and Documentation

(Emoji vote if this was helpful or unhelpful; more detailed feedback welcome in this discussion.)

mknyszek commented 1 month ago

Nice analysis! It looks sound to me. In theory there's a latency benefit to spinning, but if you have tons of threads spinning then that can create contention that's much worse than if a few of the threads just went to sleep.

I think it's been a looong time since anyone had a crack at the runtime.mutex code. I'd love to hear your ideas for improving the lock path.

I also wonder if there's a sweet spot on spinning, wherein for very short critical sections it's beneficial to have maybe one or two threads spinning (so some small limit), and the unlock path just wakes another spinning thread, expecting that one of the current spinners will win. It certainly wouldn't be a fair lock, but maybe it would be better?

rhysh commented 1 month ago

Thanks for taking a look, @mknyszek .

Yes, I think that limiting the number of spinning threads to one or two is part of the answer. If we can afford the bookkeeping (and we may need to!), I think the unlock path should be able to detect the presence of a spinning thread and should only wake a sleeping waiter if it knows that no threads are spinning.

An analogy on my mind for how to solve this is a baseball lineup: Most of the threads are "in the dugout", sleeping. One thread is "at bat", maybe doing many lock/unlock pairs in a row. And one thread is "on deck", spinning and ready to jump in when the "at bat" thread releases the mutex for a bit too long to recapture. When a thread calls unlock (it must have been "at bat"), it checks to see if there's a thread "on deck". If so, no further action is needed. If not, it does a wakeup. A new thread that becomes interested in the mutex could start off by spinning (as they do today), and then sleep after a few rounds of disappointment (also as they do today, but with "sleep" now being the improved "in the dugout" version).

I don't know if that's a solution, or if it just pushes the problem back one level. But maybe that's close enough to be a solution in practice? Maybe the "dugout" is an MCS or CLH queue. (MCS seems good, except for requiring a spin during unlock to wait for the second half of a new sleeper's two-part registration?) And maybe threads make a point of going to the end of the queue if they haven't done that for any of their last N acquisitions, in the name of fairness.

I'm working from The Art of Multiprocessor Programming. I've got the "revised first edition", which appears to be a 2012 update to a 2008 original. Are MCS and CLH still the best available algorithms for workloads like runtime.mutex has? @aclements , if you know (and you often do)? Or @chabbimilind , I see you've published in this area since 2012. Thanks!

About fairness: In lock_sema.go (GOOS=darwin, windows, and several others), new waiters push themselves onto a stack so the thread at the bottom of the stack will stay there until all of the other threads have had their turn. So maybe unfairness isn't a problem? Except that when the critical section is short, threads wake up very often and so the stack probably gets cleared out on a regular basis.

A thread being able to capture the mutex for short periods of time (rather than always queueing behind other waiting threads) looks important for performance; we wouldn't be able to average ~15 ns per channel operation if the channel's mutex needs to be handed to a different thread each time. But I suspect we also don't want the full amount of unfairness we'd get if the stack in lock_sema.go worked as the code appears to intend.

chabbimilind commented 1 month ago

The basic MCS lock does not allow you to exist from the wait queue. You may be looking for one of these: https://www.mcs.anl.gov/~aamer/papers/ppopp17_hmcst.pdf (one level variant) https://www.cs.rochester.edu/u/scott/papers/2002_PODC_nb_timeout.pdf https://www.cs.rochester.edu/u/scott/papers/2001_PPoPP_Timeout.pdf

I noticed something in the current https://github.com/golang/go/blob/master/src/runtime/lock_futex.go code in lock2: why is the l.key == mutex_unlocked condition check in a loop in the code below? If the objective is to make it behave as a test-and-test-and-set, I would expect the inner loop to be an if condition, not a loop. May be there a s reason why it is the way it is but I could not understand easily.

        for i := 0; i < spin; i++ {
            for l.key == mutex_unlocked { // loop?
                if atomic.Cas(key32(&l.key), mutex_unlocked, wait) {
                    timer.end()
                    return
                }
            }
            procyield(active_spin_cnt)
        }
gopherbot commented 1 month ago

Change https://go.dev/cl/601396 mentions this issue: runtime: measure speed of procyield and osyield

rhysh commented 1 month ago

Thanks for the references! It'll take me a little while to digest those. I don't think that lock2 needs to allow timeouts.

It looks like the more advanced mutex implementations also require more than one word of memory (though we can usually borrow some from the M while it's in a lock2 call). There's a mutex in each chan value, so I don't think we can balloon those too much.

Sometimes the limiting factor in an app is a channel's mutex (#57070), but often it's an explicit process-wide singleton like sched.lock or mheap_.lock. I wonder if, for those few special cases, we should consider using an especially scalable / collapse-resistant mutex implementation with a higher memory footprint. Maybe once we have a new single-word lock2 implementation and have used it for a few releases. :)

why is the l.key == mutex_unlocked condition check in a loop in the code below?

It's been like that since at least the 1.0 release: https://github.com/golang/go/blob/go1/src/pkg/runtime/lock_futex.c#L66

It's also unusual that the access of l.key there isn't an explicit atomic load. I think that's due to amd64 being the main target for production workloads at the time. (See also, the netpoller problems from a few years ago due to the same habits.) Another thing I plan to fix / clean up (or to better understand and then document).

rhysh commented 1 month ago

I was curious about the relative speed of procyield and osyield, and the results I got on my machines looked odd. I've shared the test code in https://go.dev/cl/601396 . In both lock_futex.go and lock_sema.go, lock2 does 4 calls of procyield(30), 1 call of osyield(), and then attempts to sleep (waiting for another thread to wake it when the lock might be available).

On amd64, procyield is the PAUSE instruction, https://www.felixcloutier.com/x86/pause.html . On my "13th Gen Intel(R) Core(TM) i7-13700H", I measure procyield(30) as taking 750 ns on the single-threaded efficiency cores, 2300 ns on the two-way SMT performance cores, and 2200 ns across the machine as a whole.

On arm64, procyield is the YIELD instruction, https://developer.arm.com/documentation/100076/0100/A64-Instruction-Set-Reference/A64-General-Instructions/YIELD . On my "M1 MacBook Air", I measure procyield(30) as taking 12 ns. (Since macOS can also run amd64 programs I measured with GOARCH=amd64 too, finding procyield(30) to take 13 ns.) On my "Raspberry Pi 5", I measure procyield(30) as taking 16 ns.

Maybe the speed of those changes based on how much bus activity there is? But I'm surprised to see more than a 100x difference between Intel and ARM processor performance there, with lock2 not accounting for that at all.

On my i7 Linux machine, osyield takes between 150 ns and 200 ns (performance vs efficiency cores). On my rpi5 Linux machine, it takes 1100 ns. On my M1 macOS machine, osyield takes 2800 ns.

I'm not surprised by a 5–7x difference in Linux performance on different hardware. I'm not surprised by a 19x difference between how Linux and macOS implement a yield to the OS scheduler. What surprises me is that on my i7 Linux machine, procyield(30) is slower than osyield, that it's slower by 15x, and that because there are 4 calls to procyield before the one call to osyield, that the machine spends 60x more time in procyield.

In BenchmarkChanContended, a single channel operation takes around 15 ns. So the lock can be acquired, used, and released over 100 times before an i7 performance core finishes a single procyield(30) call (2300 ns).

At the very least, it's not tuned well for the hardware I happen to have.

Details from benchstat below:

``` goos: darwin goarch: amd64 pkg: runtime cpu: VirtualApple @ 2.50GHz │ yield-m1-amd64 │ │ sec/op │ ProcYield/1-8 3.066n ± 0% ProcYield/10-8 6.961n ± 1% ProcYield/30-8 13.23n ± 1% ProcYield/100-8 35.08n ± 0% ProcYield/1000-8 328.2n ± 1% OSYield-8 2.789µ ± 1% geomean 45.66n goarch: arm64 cpu: Apple M1 │ yield-m1 │ │ sec/op │ ProcYield/1-8 2.234n ± 4% ProcYield/10-8 5.631n ± 0% ProcYield/30-8 11.90n ± 1% ProcYield/100-8 33.80n ± 0% ProcYield/1000-8 324.6n ± 1% OSYield-8 2.769µ ± 1% geomean 40.71n goos: linux goarch: amd64 cpu: 13th Gen Intel(R) Core(TM) i7-13700H │ yield-i7 │ yield-i7-e │ yield-i7-p │ │ sec/op │ sec/op vs base │ sec/op vs base │ ProcYield/1-20 70.63n ± 17% ProcYield/10-20 735.7n ± 9% ProcYield/30-20 2.209µ ± 17% ProcYield/100-20 6.143µ ± 23% ProcYield/1000-20 74.11µ ± 17% OSYield-20 150.7n ± 10% ProcYield/1-8 26.95n ± 0% ProcYield/10-8 251.5n ± 0% ProcYield/30-8 750.5n ± 0% ProcYield/100-8 2.506µ ± 0% ProcYield/1000-8 24.96µ ± 0% OSYield-8 197.4n ± 1% ProcYield/1-12 82.73n ± 8% ProcYield/10-12 638.2n ± 19% ProcYield/30-12 2.262µ ± 19% ProcYield/100-12 6.135µ ± 23% ProcYield/1000-12 75.06µ ± 19% OSYield-12 166.0n ± 18% geomean 1.410µ 630.5n ? ¹ ² 1.446µ ? ¹ ² ¹ benchmark set differs from baseline; geomeans may not be comparable ² ratios must be >0 to compute geomean goarch: arm64 cpu: │ yield-pi5 │ │ sec/op │ ProcYield/1-4 3.023n ± 1% ProcYield/10-4 7.931n ± 0% ProcYield/30-4 16.28n ± 0% ProcYield/100-4 45.50n ± 0% ProcYield/1000-4 434.4n ± 0% OSYield-4 1.073µ ± 0% geomean 44.97n ```
gopherbot commented 1 month ago

Change https://go.dev/cl/601597 mentions this issue: runtime: allow deeper sleep in futex-based mutexes

rhysh commented 1 month ago

I wrote https://go.dev/cl/601597 to implement the idea of "a spinning thread announces itself, which allows unlock2 to skip the wake, which allows lock2 to truly sleep". That sketch (at PS 1) is a bit slower than the current three-state mutex when there's no contention, and it degrades quickly at first.

But then it stops getting slower; "it scales". Its throughput holds steady at 1/4 of the zero-contention speed, even as the number of contending threads increases to 20 (the size of my test machine). In contrast, the performance of the baseline lock_futex.go degrades with each additional contending thread. At 8 threads, the two implementations are equal, and at 20 the performance of the baseline has slowed an additional 3x.

Another benefit that's not shown here: As the number of contending threads increases, nearly all of them are asleep. They're using less electricity and the OS can put them to use running other programs.

The microbenchmark of course has its limits: it would be easy for us to increase the argument to procyield, to slow down the spinning thread and thus makes it harder for it to acquire the mutex, to better allows the previous holder to capture it, to improve throughput at a (hidden) cost of latency. I've held off on any tuning like that. As for macro-benchmarks, perf_vs_parent builders show motion in several of their results, in both directions. I think it has potential.

Below are the benchstat results of CL 601597 PS 1 versus its parent on my test machine, in text and visual form. In the chart, I've converted the number of microseconds per benchmark iteration into the number of channel operations per second (each ChanContended op is 100 sends and 100 receives). The experiment holds steady around 13.5 million channel operations per second, while the baseline continues to slow down with each additional contending thread (to 4.5 million ops per second at 20 threads).

goos: linux
goarch: amd64
pkg: runtime
cpu: 13th Gen Intel(R) Core(TM) i7-13700H
                │     old     │                  new                  │
                │   sec/op    │    sec/op     vs base                 │
ChanContended      3.147µ ± 0%    3.703µ ± 0%   +17.65% (p=0.000 n=10)
ChanContended-2    4.511µ ± 2%    5.280µ ± 7%   +17.06% (p=0.000 n=10)
ChanContended-3    5.726µ ± 2%   12.125µ ± 2%  +111.75% (p=0.000 n=10)
ChanContended-4    6.574µ ± 1%   13.356µ ± 4%  +103.16% (p=0.000 n=10)
ChanContended-5    7.706µ ± 1%   13.717µ ± 3%   +78.00% (p=0.000 n=10)
ChanContended-6    8.830µ ± 1%   13.674µ ± 2%   +54.85% (p=0.000 n=10)
ChanContended-7    11.07µ ± 0%    13.59µ ± 2%   +22.77% (p=0.000 n=10)
ChanContended-8    13.99µ ± 1%    14.06µ ± 1%         ~ (p=0.190 n=10)
ChanContended-9    16.93µ ± 2%    14.04µ ± 3%   -17.04% (p=0.000 n=10)
ChanContended-10   20.12µ ± 4%    14.12µ ± 1%   -29.80% (p=0.000 n=10)
ChanContended-11   23.96µ ± 2%    14.44µ ± 3%   -39.74% (p=0.000 n=10)
ChanContended-12   29.65µ ± 6%    14.61µ ± 3%   -50.74% (p=0.000 n=10)
ChanContended-13   33.98µ ± 7%    14.69µ ± 3%   -56.76% (p=0.000 n=10)
ChanContended-14   37.90µ ± 1%    14.69µ ± 3%   -61.23% (p=0.000 n=10)
ChanContended-15   37.94µ ± 4%    14.89µ ± 5%   -60.75% (p=0.000 n=10)
ChanContended-16   39.56µ ± 0%    13.89µ ± 1%   -64.89% (p=0.000 n=10)
ChanContended-17   39.56µ ± 0%    14.45µ ± 4%   -63.47% (p=0.000 n=10)
ChanContended-18   41.24µ ± 2%    13.95µ ± 3%   -66.17% (p=0.000 n=10)
ChanContended-19   42.77µ ± 5%    13.80µ ± 2%   -67.74% (p=0.000 n=10)
ChanContended-20   44.26µ ± 2%    13.74µ ± 1%   -68.96% (p=0.000 n=10)
geomean            17.60µ         12.46µ        -29.22%
Screenshot 2024-07-29 at 1 00 44 PM

I read Rust's implementation, https://github.com/rust-lang/rust/blob/1.80.0/library/std/src/sys/sync/mutex/futex.rs . It looks similar in structure to our current lock_futex.go implementation.

It does not do osyield. @prattmic pointed me to https://go.dev/cl/473656, which links to https://www.realworldtech.com/forum/?threadid=189711&curpostid=189752, which describes why it's best to avoid osyield (which on Linux is a call to sched_yield(2)). Maybe we should remove that (as a separate change).

On arm64, Rust's Mutex implements the busy loop pause as ISB SY, https://github.com/rust-lang/rust/blob/1.80.0/library/core/src/hint.rs#L243 . They used to do YIELD as we do today, but stopped at https://github.com/rust-lang/rust/commit/c064b6560b7ce0adeb9bbf5d7dcf12b1acb0c807 . Maybe that's right for Go too (to be measured, and done as a separate change).

gopherbot commented 1 month ago

Change https://go.dev/cl/602176 mentions this issue: runtime: benchmark mutex handoffs

gopherbot commented 1 month ago

Change https://go.dev/cl/604375 mentions this issue: runtime: add direct benchmark of mutex contention

gopherbot commented 1 month ago

Change https://go.dev/cl/606901 mentions this issue: cmd/compile/internal/ssa: add atomic.Xchg8 intrinsic

gopherbot commented 1 month ago

Change https://go.dev/cl/606900 mentions this issue: internal/runtime/atomic: add Xchg8 for amd64