Open rhysh opened 6 months ago
At a glance, this plan seems reasonable to me.
One note on assumed constraints:
We cannot change the size of a runtime.mutex, it needs to remain uintptr.
While it is nice to avoid increasing the size of a mutex, I don't think this needs to be a hard requirement. Taking a look at where we use mutex, it looks like timers and channels probably have the most objects and would thus have the biggest impact. But neither of those structures is tiny, so I don't think it would be the end of the world if they get bigger.
Change https://go.dev/cl/585635 mentions this issue: runtime: prepare for extensions to waiting M list
Change https://go.dev/cl/585639 mentions this issue: runtime: remove GODEBUG=runtimecontentionstacks
Change https://go.dev/cl/585638 mentions this issue: runtime: profile mutex contention during unlock
Change https://go.dev/cl/585636 mentions this issue: runtime: use semaphore structure for futex locks
Change https://go.dev/cl/585637 mentions this issue: runtime: double-link list of waiting Ms
Change https://go.dev/cl/586237 mentions this issue: runtime: improve runtime-internal mutex profile tests
Change https://go.dev/cl/587175 mentions this issue: runtime/pprof: shorten TestGoroutineSwitch
Change https://go.dev/cl/586796 mentions this issue: runtime: split mutex profile clocks
Change https://go.dev/cl/589096 mentions this issue: Revert "runtime: improve runtime-internal mutex profile tests"
Change https://go.dev/cl/589116 mentions this issue: Revert "runtime: double-link list of waiting Ms"
Change https://go.dev/cl/589098 mentions this issue: Revert "runtime: use semaphore structure for futex locks"
Change https://go.dev/cl/589097 mentions this issue: Revert "runtime: remove GODEBUG=runtimecontentionstacks"
Change https://go.dev/cl/589115 mentions this issue: Revert "runtime: profile mutex contention during unlock"
Change https://go.dev/cl/589117 mentions this issue: Revert "runtime: prepare for extensions to waiting M list"
Change https://go.dev/cl/589095 mentions this issue: Revert "runtime: split mutex profile clocks"
Reverted for the Go 1.23 cycle due to #67585
This direction seems right. We should blame the lock wait time to the wait holder (the concept is called blame shifting). It should be easier in Go since there is already a list of lock-waiters in the runtime. It was done via sampling profilers: https://www.cs.rice.edu/~johnmc/papers/hpctoolkit-ppopp-2010.pdf
It is also a thing in OpenMP runtime tools API: https://www.openmp.org/wp-content/uploads/OMPT-SC13-BOF.pdf
Hmm. The blame shifting seems less useful than knowing what was blocked where and for how long. Ideally it seems it would report both - what was blocked for how long and what the lock holder was doing - but this feels like nested profiling which is pretty hard to follow.
Yes, the semantics of the "mutex" profile are to shift blame to the slow critical section. Part of why this didn't make it for the Go 1.23 cycle is that we only have the explicit list of lock waiters for the lock_sema.go platforms, and the extra bookkeeping required to add it for GOOS=linux was an unwelcome slowdown.
It is also useful to know what was blocked and for how long—for blocking that takes place outside of the runtime, that's reported on the "block" profile. Do you have an example of where that would have been useful for runtime-internal blocking, @robaho (maybe an open issue that is/was hard to debug)? If so, I think the next step would be to open a new issue to discuss adding it. I think it's possible, but we'd need to get consensus on whether it's worth the costs.
Fixing this didn't happen for Go 1.23, I'm planning to try again for Go 1.24.
Change https://go.dev/cl/622997 mentions this issue: runtime: unify lock2, allow deeper sleep, control tail
As of Go 1.22, the mutex profile measures contention on internal mutex values, but attributes the contention to the function
runtime._LostContendedRuntimeLock
. This should be improved.When running with
GODEBUG=runtimecontentionstacks=1
, the mutex profile attributes to the caller ofruntime.unlock
all the delay that the M encountered during its precedingruntime.lock
call to obtain that lock. This is incorrect: the delay attributed to aruntime.unlock
call should instead correspond to the total time that the other Ms were unable to proceed because this M was holding the lock.I have the start of a plan to resolve this, so I'm seeking feedback from @mknyszek and @prattmic (or @golang/runtime or others) on whether I'm on the right track — both in the start of the plan, and on whether I'm working to the right set of constraints. Thank you!
The mutex profile for
sync.Mutex
uses fields on thesudog
to track how many Gs have been waiting and for about how long. Thesudog
s form a linked list of waiters, and are dequeued under the control of the runtime. When waking a G, the caller ofsync.Mutex.Unlock
is able to claim responsibility for the delay that it specifically caused, and update the remaining waiters' start time so the nextUnlock
er can claim the right amount too. We'll need similar information for the Ms that hold runtime-internal locks.I'd like to say a few assumed constraints out loud:
futex(&l.key, FUTEX_WAKE, 1)
can wake any M that was waiting onl.key
.runtime.mutex
, it needs to remainuintptr
.sync.Mutex
).muintptr
as flags (as seen in lock_sema.go).sudom
struct, to represent waiting Ms as thesudog
represents waiting Gs.For OSes where we use lock_sema.go, an M that needs to block on a
runtime.mutex
first pushes itself onto the head of a linked list (runtime.mutex.key
, linked viaruntime.m.nextwaitm
), and then sleeps on its own private semaphore. When the current holder of the lock is done, it pops an M off of the list and wakes that specific M. We could add fields to the M for the number of Ms after it on the list, for the time it went to sleep, and for the time that the tail M went to sleep (or a pointer to the tail M). I think that would be sufficient information to match the approximation we do forsync.Mutex
viasudog
, and would need only a simple set of invariants to update when pushing and popping Ms to/from this list.For OSes such as Linux where we use lock_futex.go, the kernel-controlled wakeup order means we'll need something more.
runtime.mutex.key
will point to the waiting M at the head of the linked list (as in lock_sema.go).runtime.lock
).runtime.unlock
) OR removing itself from the list (in the bottom half ofruntime.lock
) at a time.atomic.Xchg
speculative grab.This uses space on the waiting Ms to store the necessary information; they're not using it for anything else while they're blocked.
I'll prepare a CL to implement this plan, but if I've gotten some assumptions wrong or it's otherwise clear that this won't work, it may be easier to spot and to discuss it here rather than in the details of code review. Thanks!