golang / go

The Go programming language
https://go.dev
BSD 3-Clause "New" or "Revised" License
123.98k stars 17.67k forks source link

sync: shrink types in sync package #37142

Open balasanjay opened 4 years ago

balasanjay commented 4 years ago

Proposal: shrink types in sync package

Current API

The types in the sync package make use of semaphore operations provided by the runtime package. Specifically, they use the following two APIs

semacquire(s *uint32): waits until *s > 0 and atomically decrements it.

semrelease(s *uint32): atomically increments *s and notifies any goroutine blocked in a semacquire (if any).

Each semaphore logically manages a queue of goroutines. The sync types store their atomically-modified state separate from their semaphores; for this reason, semaphores have to support out-of-order operations (e.g. when unlocking a sync.Mutex, you could have a semrelease show up to the semaphore 'before' a corresponding semacquire). The separation into atomic state and semaphores also results in the sync types having a larger memory footprint.

Proposed new API

Let's consider a slightly different API for queueing and dequeueing goroutines.

semqueue(s *uint32, mask uint32, cmp uint32, bit uint8) bool: checks (*s & mask) == cmp; if false, will immediately return false (indicating 'barging'). Otherwise, puts the calling goroutine to sleep; if this goroutine is the first to sleep for the given s and bit, will also perform *s = *s | (1 << bit).

semdequeue(s *uint32, bit uint8, dequeueAll bool) int: wakes up 1 goroutine sleeping on a corresponding semqueue call with a matching s and bit (or all, if dequeueAll is true). If there are no more goroutines sleeping, then will unset the corresponding bit in *s: *s = *s & ^(1 << bit). Returns the number of woken goroutines.

Notes about this API: The names listed above are placeholders (other suggestions welcome).

All of the operations for the given function are atomic (by making use of a lock keyed on s).

Also, the real signatures will likely need to be more complicated to support mutex profiling, and FIFO vs LIFO queueing.

Benefits

Given this kind of API, I believe we could shrink the types in the sync package. This is because this new API only ever modifies a single bit in an atomic word, so all the other bits are available to store atomic state. We can manage multiple logical queues using the same 4-byte atomic word; this will allow it to handle types like sync.RWMutex that have a queue for readers and a queue for writers. Here are the possible savings for reimplementing various types in the sync package:

Type Current Size (bytes) New Size (bytes)
sync.Once 12 4
sync.Mutex 8 4
sync.RWMutex 24 4

These are the ones I'm fairly confident about. I also think we can shrink sync.Waitgroup, and sync.Cond (the former from 12 bytes to 4 bytes, and I think we can shave off 28 bytes from the latter), but I'm less sure of these two as I'm unfamiliar with their implementations.

Also, while this isn't the goal, I think this might also improve the performance of (*sync.RWMutex).Unlock. It currently repeatedly calls semrelease in a loop to wake readers, acquiring and releasing locks for each iteration. The API above offers batch dequeue functionality, which will allow us to avoid a number of atomic operations.

Backwards Compatibility

This is only modifying internal details of the Go runtime, so from that perspective it should be backwards compatible. There are two other considerations here:

If any users are carefully sizing their types to avoid false sharing or to align with cache-line boundaries, changing the size of types in sync could be problematic for them.

There are users who are using go:linkname to directly invoke semacquire and semrelease. See gVisor for an example of this. So, even if we convert all uses in the standard library, we will likely want to have these functions stick around for a little while.

Edits

balasanjay commented 4 years ago

Tagging owners of sync package: @rsc @ianlancetaylor @dvyukov @aclements

networkimprov commented 4 years ago

It would help if any new implementation would also accommodate a future RWMutex.TryRLock(), which is not trivial to implement (unlike TryLock).

Discussion at https://groups.google.com/d/topic/golang-nuts/h_Dm-XtrlOA/discussion

balasanjay commented 4 years ago

@networkimprov That seems entirely orthogonal to this proposal (it could be implemented with the existing internal semaphore APIs or with the proposed new ones). This proposal also doesn't involve any public API changes. It would therefore seem best to file a separate proposal for that.

networkimprov commented 4 years ago

I wasn't suggesting you add a new API in this proposal, but flagging a missing feature of sync.RWMutex which any reimplementation should allow for.

balasanjay commented 4 years ago

Got it, thanks. I believe the implementation I have in mind could accommodate a TryRLock method.

rsc commented 4 years ago

@networkimprov, I don't believe we've decided that TryRLock is a good idea to add at all, so it seems not necessary to condition this proposal on support for it.

This is not a proposal in the sense that it does not create new API. It definitely needs a discussion and decision from the owners of the package (me, @dvyukov, @aclements is a good set).

Removing from the proposal process and changing to NeedsDecision.

dvyukov commented 4 years ago

Are there any significant downsides? It seems that this mostly improve things and don't change public interfaces, if so this like a no brainer to me wrt to decision. Somebody just needs to do it :) I understand it may slightly increase code complexity, or maybe not (comparable complexity to current code). It looks reasonable to me, moreover the complexity is contained within semaphore/mutex. Re false sharing, I don't think we should significantly take this into account. Nobody never promised exact sizes, and there are ways to resolve this and hopefully the careful users have build asserts for their assumptions re implementation details.

balasanjay commented 4 years ago

I'd be happy to implement this, assuming that owners are ok with the plan.

(Since I'm not a fan of chainsaw juggling, I'd want to first write the new implementations in Promela or Pluscal first to make sure they're sound and the APIs proposed above are usable)

I don't know of any significant downsides. I don't have concrete implementations yet, so I don't have an answer for you regarding code complexity. Based on the implementation in my head, I think it'd be comparable.


One question: what is the maximum number of concurrent RLocks we need to support in RWMutex?

I realized that the original implementation I had in mind for RWMutex used up 7 bits for control state, leaving only 25 bits for a count of readers; this would cap the number of concurrent RLocks to 33 million, which feels like it might be too low. So for now, I've increased the RWMutex type up to 8 bytes.

I think its possible to compress the control state down into 5 bits (two queue bits, 1 "writerStarving" bit, and 2 bits that record whether the mutex is unlocked/read-locked/write-locked), which increases the max count to 134 million concurrent RLocks. If that's enough, then we should be able to use 4 bytes again.

dvyukov commented 4 years ago

I don't think we have exact number. I hope nobody blocks more than 32M goroutines on a single mutex :) However, this number seems to be close to max number of goroutine users may use today. And assuming all worker goroutines rlock some mutex and there is 1 episodic writer, this scenario does not look completely impossible. Perhaps it's possible to provide some kind of safe but slow fallback? E.g. make all excessive readers spin and sleep until number of readers drops.

bcmills commented 4 years ago

what is the maximum number of concurrent RLocks we need to support in RWMutex?

Perhaps it's possible to provide some kind of safe but slow fallback?

sync.RWMutex already scales very poorly anyway (#17973), so it is unlikely that many folks are using a large number of concurrent readers, and a safe-but-slow fallback seems fine for now.

(I notice that @balasanjay is the author of CL 215361, so this probably not new information but perhaps useful to other readers of this thread.)

balasanjay commented 4 years ago

I am now pretty sure that 5 bits of control state are enough. I think 134 million concurrent RLocks should probably be enough for everybody.

If anyone disagrees with this, then I think one possible solution is to have readers queue even when the mutex is in read-locked mode. And RUnlocks that see readers queued with no writers queued would wake a single reader. That overflow behavior seems difficult to model without a state-space explosion, though, so I'd rather avoid it. If we do need to avoid it, then representing an RWMutex in two words is probably the simplest solution.

@dvyukov One other downside: the current RLock implementation's fast-path is an AddUint32. I think the new implementation would need to increment the reader count with a CAS-loop (i.e. only increment the reader count if the mutex stays in read-locked mode).

CAFxX commented 4 years ago

Unless this solves actual scalability problems (the benefits section does not list much in this regard, where the only ones listed are apparently using less memory and making the Unlock slow path faster) I would recommend to ensure we don't regress performance of the fast paths.

E.g. making the Unlock slow path a bit faster at the expense of the fast path does not seem like a worthwhile direction.

balasanjay commented 4 years ago

Fair enough; I'll do some work to quantify the difference between an AddUint32-based fast-path and a CAS-loop-based fast path (by just making the current implementation use a CAS-loop).

The fast path is only actually fast when there is no contention, and in that case, I'd expect a Load+CAS to be roughly as fast as an atomic add. FWIW, I did check that both Folly's SharedMutex and Abseil's Mutex types use a Load+CAS for their fast paths.

To be clear, the benefit I'm after is simply to reduce the space used by the sync types; when trying to write cache-line optimized data structures, you don't really have a lot of room to spare and usually are forced to resort to terrible bit-packing hacks. Using 24 bytes for an RWMutex type makes this much harder than necessary.

balasanjay commented 4 years ago

Hmm, I think it might be impossible to make an inlineable fast-path using a Load+CAS. This function is marked as not-inlineable:

func (rw *RWMutex) rlock() {
    rc := atomic.LoadInt32(&rw.readerCount)
    if atomic.CompareAndSwapInt32(&rw.readerCount, rc, rc+16) {
        return
    }

    rw.rlockSlow()
}                                                             

This has a cost of 82, and I can't seem to get it below 80. Also, any actual implementation would need slightly more in here (e.g. a test of "rc & constantMask == constantValue" before the CAS). That's rather unfortunate. I suppose I could just add a hack to the compiler to increase the inlining limit for some functions in the sync package, but that feels somewhat gross.

OneOfOne commented 4 years ago

@balasanjay #17566 and #22310 are slightly related to that last comment.

josharian commented 4 years ago

@balasanjay I think you should hack the compiler temporarily to inline this, in order to make progress. If it turns out the Load+CAS isn't performant enough, no harm done. And if it is a viable path forward, we can focus our attention on getting it past the execrable inliner.

E.g. I just noticed that we're charging a budget of 2 for intrinsic calls, one for the intrinsic and one for the CALLFUNC. But given that they're assumed to be comparable in cost to an arithmetic operation, we should just assign them a cost of 1. I suspect this is an oversight/accident. (Insofar as any of this can be called principled.) Fixing this gets your function inlined, barely. :)

diff --git a/src/cmd/compile/internal/gc/inl.go b/src/cmd/compile/internal/gc/inl.go
index 48c7de327d..4379be18ba 100644
--- a/src/cmd/compile/internal/gc/inl.go
+++ b/src/cmd/compile/internal/gc/inl.go
@@ -321,7 +321,7 @@ func (v *hairyVisitor) visit(n *Node) bool {
                }

                if isIntrinsicCall(n) {
-                       v.budget--
+                       // Treat like any other node.
                        break
                }

I'll plan to mail this as a CL during Go 1.15 and see whether it flies.

In the meantime, please forge ahead using whatever means necessary. :)

balasanjay commented 4 years ago

Alrighty, I benchmarked AddInt32 vs a Load+CAS (both a branch-before variant, and a branch-after variant), and it appears I was wrong to be optimistic:

name              time/op
Add               5.91ns ± 4%
AddCASBranchy     10.4ns ± 8%
AddCASBranchLess  11.0ns ± 3%
Code ``` func BenchmarkAdd(b *testing.B) { foo := int32(0) for i := 0; i < b.N; i++ { nv := atomic.AddInt32(&foo, 2) if nv&1 == 0 { continue } panic("slow path") } _ = foo } func BenchmarkAddCASBranchy(b *testing.B) { foo := int32(0) for i := 0; i < b.N; i++ { ov := atomic.LoadInt32(&foo) if ov&1 == 0 && atomic.CompareAndSwapInt32(&foo, ov, ov+2) { continue } panic("slow path") } _ = foo } func BenchmarkAddCASBranchLess(b *testing.B) { foo := int32(0) for i := 0; i < b.N; i++ { ov := atomic.LoadInt32(&foo) & (^1) if atomic.CompareAndSwapInt32(&foo, ov, ov+2) { continue } panic("slow path") } _ = foo } ```

That's unfortunate. I think I know how to implement RLock with an AddUint32-based fast-path and a fuzzy count of readers, but its not quite as straightforward as what I had in mind.

I'll try to spend some time this weekend writing a model of my idea to validate it.

balasanjay commented 4 years ago

I wrote a model of sync.Mutex using the proposed new API, and simulated all interleavings for 4 concurrent goroutines. I successfully verified safety (specifically that no two goroutines were ever in the critical section at the same time).

I then tried to verify starvation-freedom for the model, but even with 2 goroutines, the model checker quickly found a trivial counter-example. I think this counter-example is an issue even with the current Mutex implementation. Essentially, the sequence of events is:

We're in a loop, so the strong starvation freedom property fails to verify for goroutine W2 (the property I was trying to verify was "if a goroutine starts trying to lock the mutex, then that goroutine will eventually acquire the mutex").

That said, this execution trace seems fairly unrealistic. I don't think we've seen starvation of this sort with the current implementation in practice (which AFAICT, has an exactly analogous starvation event possible).

I tried a weaker starvation freedom property ("if a goroutine starts trying to lock the mutex, then some goroutine will eventually acquire the mutex") and verified that for 3 goroutines. That gives me confidence that there isn't some sequence of events which leaves the mutex in a corrupt state that can never be locked again. I might leave 4 goroutines running overnight (liveness properties are much slower to check).

I'll work on adding read-locks to the model next.

Here's the model so far: rwmutex.tla rendered to pdf

CAFxX commented 4 years ago

W2 tries to lock mutex, finds it locked and decides to enable "starvation mode" for the mutex W1 unlocks mutex (using fast-path unlock CAS)

Isn't this impossible in the current implementation? If starving we will fall back to the slow path, since new will not be 0.

https://github.com/golang/go/blob/6917529cc604bad3b7d67579ca8d569442e3d880/src/sync/mutex.go#L186-L191

balasanjay commented 4 years ago

If the mutex were successfully put into starvation mode, then yep, it would prevent the fast path unlock.

However, what I'm saying is that W2 decides to enable starvation mode but has not actually done so yet. Then, W1 unlocks the mutex (using the fast-path). Then W2 actually tries to CAS in the starvation mode bit; this CAS to enable starvation mode will fail (because W1 unlocked the mutex).

In this trace, the mutex never successfully gets put into starvation mode.

balasanjay commented 4 years ago

Alrighty, I have a model of an RWMutex implementation.

I think it ticks all the boxes (has the same fast paths, handles writer starvation, alternates between writers and batches of readers (also, as a nice-to-have, it has 3 states stored in 2 bits, leaving 1 leftover state for use by a future solution to #17973), and fits in a single uint32 (assuming we're ok with max 133 million readers)). I believe we could implement Mutex in terms of this RWMutex; I think if no one ever calls any of the reader APIs, it'll basically identically to today's Mutex.

One issue I ran into is that it appears that the semdequeue API I proposed above is a bit too simple. In particular, when a writer is handing off a lock to a waiting batch of readers, it wants to atomically switch the state of mutex from "write-locked" to "read-locked", and set the reader-count to the number of dequeued goroutines, and clear the flag indicating that there are readers queued. The original API I proposed above will only do the last one of these.

I think its possible to work around this by letting the reader count go "negative" and try and work with async events again.

But I think its simpler to just split the dequeue API into two to accommodate this use-case cleanly. One is the semdequeueSingle (pretty much the same as what I said before). Dequeues a single goroutine and if its the last one in the queue unsets the waiting bit (we don't need anything fancier for this).

The other is semdequeueAll, which will wake all goroutines, and crucially before unsetting the queue bit, it invokes a passed-in function pointer to compute a new value for the atomic word given the old value and the number of dequeuing goroutines. It takes the output of that function, clears the queue-bit from that, and then CAS-es that final value into place (all of this in a loop, until the CAS succeeds). This additional hook will let us run the necessary logic to atomically wake the goroutines and handoff the mutex to them (it'd be atomic because we're holding the runtime-internal mutex while we do all this, meaning no new readers can queue up while we're doing this). I admit its a tiny bit sketchy to call arbitrary callbacks while holding a runtime lock; if folks prefer, we can just hard-code the behavior we want directly in the runtime (we'll likely need to pass a flag to enable the "RWMutex" dequeue behavior, because I suspect when we get around to migrating WaitGroup or Cond, they'll need slightly different semantics because they probably don't need to store a count in the same way).


I've verified safety properties for the model with 3 readers and 1 writer, with 1 reader and 3 writers, and with 2 readers and 2 writers (which found a number of errors in the model, I should note). Anything bigger looked like it would take too long on my laptop.

I want to run a simulation with 3 readers and 2 writers (after 3 hours of simulation, it had only gotten ~60 steps deep, and I think it'll finally end up being something like 100 steps). I'll try that on my workstation sometime this week, or failing that I guess I'll need to figure out how to make the distributed simulation version work.

I should probably try to optimize the model a bit. I wrote it like pseudocode, so each function is composed of many atomic steps. I bet if I wrote it more like math, I could reduce the number of atomic steps in each function, which might dramatically reduce the state space that needs to be explored.

Updated model files: rwmutex.tla rwmutex.pdf

dvyukov commented 4 years ago

I would say it's OK to have semdequeue to know about rwmutex details and just do the necessary update. It's nice to have it generic, but we also don't have to, it's completely internal api. We could make it accept an "enum mode" parameter to support rwmutex/cond/waitgroup/etc.

balasanjay commented 4 years ago

Cool, 3 readers and 2 writers verified successfully. Also the number of states dramatically increased:

1972646627 states generated, 435337469 distinct states found, 0 states left on queue.

Any other objections to me working on this when the tree's open?

balasanjay commented 4 years ago

@rsc @aclements Gentle ping.

(If this sounds good, I'm hoping to work on it next weekend and have CLs out for Monday)

cristaloleg commented 3 years ago

@balasanjay kindly ping. What is the status? (just a curiosity, no pressure)

balasanjay commented 3 years ago

Thanks for the ping, I completely lost track of this work.

As I recall, I wrote about 20% of the code (some of the runtime-internal logic), before COVID became serious and I dropped this to focus on other more pressing things.

I need to ramp back up on this and dig out that repo. I'll see if I can do that in the next month, so I can catch the beginning of the next release cycle.