runtime: correct attribution of internal mutex contention

rhysh commented 6 months ago

As of Go 1.22, the mutex profile measures contention on internal mutex values, but attributes the contention to the function runtime._LostContendedRuntimeLock. This should be improved.

When running with GODEBUG=runtimecontentionstacks=1, the mutex profile attributes to the caller of runtime.unlock all the delay that the M encountered during its preceding runtime.lock call to obtain that lock. This is incorrect: the delay attributed to a runtime.unlock call should instead correspond to the total time that the other Ms were unable to proceed because this M was holding the lock.

I have the start of a plan to resolve this, so I'm seeking feedback from @mknyszek and @prattmic (or @golang/runtime or others) on whether I'm on the right track — both in the start of the plan, and on whether I'm working to the right set of constraints. Thank you!

The mutex profile for sync.Mutex uses fields on the sudog to track how many Gs have been waiting and for about how long. The sudogs form a linked list of waiters, and are dequeued under the control of the runtime. When waking a G, the caller of sync.Mutex.Unlock is able to claim responsibility for the delay that it specifically caused, and update the remaining waiters' start time so the next Unlocker can claim the right amount too. We'll need similar information for the Ms that hold runtime-internal locks.

I'd like to say a few assumed constraints out loud:

On Linux, futex(&l.key, FUTEX_WAKE, 1) can wake any M that was waiting on l.key.
We cannot change the dynamics of locks on Linux (such as by putting the runtime in charge of picking an M to wake, as it is with the semaphore-based implementation).
We cannot make interactions with an especially contended lock become O(N) with the number of waiters (any more than it is today, such as from the CPU's cache coherency work).
We cannot change the size of a runtime.mutex, it needs to remain uintptr.
An M can hold multiple contended locks at the same time.
It's OK to grow the M struct by several words.
It's OK to approximate the total wait time (as we do with sync.Mutex).
The M struct has some alignment (8 bytes?), so we can use a few of the least significant bits of an muintptr as flags (as seen in lock_sema.go).
An M can be waiting for only a single lock at a time.
There's no appetite for a new sudom struct, to represent waiting Ms as the sudog represents waiting Gs.

For OSes where we use lock_sema.go, an M that needs to block on a runtime.mutex first pushes itself onto the head of a linked list (runtime.mutex.key, linked via runtime.m.nextwaitm), and then sleeps on its own private semaphore. When the current holder of the lock is done, it pops an M off of the list and wakes that specific M. We could add fields to the M for the number of Ms after it on the list, for the time it went to sleep, and for the time that the tail M went to sleep (or a pointer to the tail M). I think that would be sufficient information to match the approximation we do for sync.Mutex via sudog, and would need only a simple set of invariants to update when pushing and popping Ms to/from this list.

For OSes such as Linux where we use lock_futex.go, the kernel-controlled wakeup order means we'll need something more.

The contents of runtime.mutex.key will point to the waiting M at the head of the linked list (as in lock_sema.go).
The list must be doubly-linked so that an M in the middle (if it's the one the kernel chooses to wake) can remove itself.
Any number of Ms may be adding themselves to the list concurrently (in the top half of runtime.lock).
But only one M will be reading/editing wait start timestamps and element counts (in the top half of runtime.unlock) OR removing itself from the list (in the bottom half of runtime.lock) at a time.
Also, if there are waiting Ms the mutex must always reflect that fact, which means removing lock_futex.go's initial atomic.Xchg speculative grab.

This uses space on the waiting Ms to store the necessary information; they're not using it for anything else while they're blocked.

I'll prepare a CL to implement this plan, but if I've gotten some assumptions wrong or it's otherwise clear that this won't work, it may be easier to spot and to discuss it here rather than in the details of code review. Thanks!

prattmic commented 6 months ago

At a glance, this plan seems reasonable to me.

One note on assumed constraints:

We cannot change the size of a runtime.mutex, it needs to remain uintptr.

While it is nice to avoid increasing the size of a mutex, I don't think this needs to be a hard requirement. Taking a look at where we use mutex, it looks like timers and channels probably have the most objects and would thus have the biggest impact. But neither of those structures is tiny, so I don't think it would be the end of the world if they get bigger.