Open CannibalVox opened 1 year ago
cc @golang/runtime
Moving this out of proposal. (In the past we have phrased these kinds of internal changes as proposals but I think we've stopped doing that as the proposal process became more of an actual process. And given that all the changes here would be internal, I don't see a reason as to why this needs to go through the proposal review process. This is more about the merits of the implementation anyway.)
In https://github.com/golang/go/issues/54622, the case is laid out that unnecessarily raising threads for a brief boost in workload can have undesirable performance implications. Effectively, this task identifies that the largest extant performance issues in the scheduler today are related to unnecessary thread creation and destruction.
For clarification, the issue is not unnecessary thread creation and destruction, but unnecessary thread wake and sleep. Most programs reach a steady state of thread count fairly quickly (we ~never destroy threads). It is the wakeup of an idle thread and subsequent sleep when that thread has nothing else to do that is expensive.
IIUC, this proposal introduces a wakeup of the syscall thread for every syscall (unless the syscall thread is already running). I suspect that this would result in a significant performance degradation for most programs, even if improves the tail case for long syscalls.
In #54622, thread sleep is particularly expensive because the Go runtime does so much work trying to find something to do prior to sleep. This proposal wouldn't have that problem; the conditions for the syscall thread to sleep would be much simpler. But I still think the OS-level churn of requiring a thread wakeup (a several microsecond ordeal) just to make any syscall will be a non-starter.
Compatibility
Users often get/set various bits of thread-specific state via syscall.Syscall and having them fetch from a different thread would break those use cases.
That said, the scheduler can migrate goroutines between threads at any time, so I think we could argue this only matters for goroutines that called runtime.LockOSThread
. Those would need to make syscalls directly on the calling thread.
IIUC, this proposal introduces a wakeup of the syscall thread for every syscall (unless the syscall thread is already running). I suspect that this would result in a significant performance degradation for most programs, even if improves the tail case for long syscalls.
The intent was that the threads would be live and waiting on some sort of sync primitive rather than needing to be resumed
Users often get/set various bits of thread-specific state via syscall.Syscall and having them fetch from a different thread would break those use cases.
I would expect syscall.Syscall to be executed on the syscall thread, so for OS-locked thread, thread context would all be present in the same place, the syscall thread.
But I still think the OS-level churn of requiring a thread wakeup (a several microsecond ordeal) just to make any syscall will be a non-starter.
Additionally, not to put too fine a point on it, but this is already the plan for syscalls that take longer than a microsecond.
The intent was that the threads would be live and waiting on some sort of sync primitive rather than needing to be resumed
Could you be more specific about what you mean here? The main options I can think of here are:
1 and 2 burn CPU continuously (2 slightly more efficiently), 3 burns CPU unless the system is fully loaded, and 4 requires a wake-up (and is what I was referring to).
The intent was that the threads would be live and waiting on some sort of sync primitive rather than needing to be resumed
Isn't that the same as putting the thread to sleep, assuming you mean something like a Linux FUTEX_WAIT
/FUTEX_WAKE
? That's already how the runtime suspend threads https://cs.opensource.google/go/go/+/master:src/runtime/proc.go;drc=261fe25c83a94fc3defe064baed3944cd3d16959;l=1528?q=proc.go&ss=go%2Fgo (notesleep
just blocks on a FUTEX_WAIT
).
EDIT: Sorry, see Michael's comment, which is more complete.
The intent was that the threads would be live and waiting on some sort of sync primitive rather than needing to be resumed
Could you be more specific about what you mean here? The main options I can think of here are:
- Busy loop
- Busy loop with PAUSE instruction
- Loop calling sched_yield (or equivalent syscall)
- Block in futex (or other wake-able syscall)
1 and 2 burn CPU continuously (2 slightly more efficiently), 3 burns CPU unless the system is fully loaded, and 4 requires a wake-up (and is what I was referring to).
I understand now- I guess the faster wakeups in go primitives are due to the fact that the P stays in motion continuously.
It's safe to say that the design as written won't work, then, but that mainly pushes me toward the alternatives. As you identified, waking and sleeping a thread with every syscall is fairly untenable- Go is in a state right now where network communications on windows have massive performance issues because it does exactly that. Having 1 thread burn CPU per P is unacceptable, but having 1 thread total do it + others for a short periods at the tail end of a burst is not. The current situation is fairly dire.
I certainly agree that the bad cases of syscall churn could use improvement. I haven't had a chance to look closely at #58336, but it seems like that provides a good example case.
Sorry to chime in, I'll try to be the voice of others:
Problem statement
My use case is a Cgo call to glfw.SwapBuffers()
which is a wrapper for the C function glfwSwapBuffers()
, a common graphics API. When VSync is enabled, it internally blocks to synchronize with the user's window compositor / monitor refresh rate. That brief pause will frequently go beyond 20ns and then some overhead of context switching / launching a separate thread to maintain GOMAXPROCS to not starve the goroutines causes a brutal stutter that can skip 1 to 5 frames and is very noticeable in a soft real-time application.
Trying to understand
If I'm understanding correctly the problem, then I'm even more confused because this is happening on the main thread in a os-locked goroutine with os.LockOSThread()
, so nothing else can already run on that thread. There's no other goroutines on that thread that can be starved of work, so what are we yielding to here?
Even then, I think a special case of tolerating one core to be blocked momentarily is fairly reasonable when you have other cores available. They'll just do a bit more work. I understand if that happened to all cores, but 1 should be acceptable and so common that I'm surprised it's not handled differently. The workers are work-stealing aren't they?
Maybe another option is to be able to mark the function as "blocking" and that'd be fine. That's actually desirable for some people. We just need an escape hatch somewhere. There's zero control currently.
Closing words
Anyway, I know soft real-time isn't a priority to the Go team. You'd think garbage collection would be the primary blocker for soft real-time, but it isn't. The GC is great. This single issue with cgo and scheduling is actually what has plagued so many before me; Docker, Sqlite, CockroachDB, Dqlite [1], etc.
That brief pause will frequently go beyond 20ns and then some overhead of context switching / launching a separate thread to maintain GOMAXPROCS to not starve the goroutines causes a brutal stutter that can skip 1 to 5 frames and is very noticeable in a soft real-time application.
Oof. That sounds frustrating. (I assume by 20ns you meant 20µs?) I encourage you to file a new issue so that your specific case can be discussed in more detail. Having a separate issue filed for this will be useful when looking at scheduler issues holistically.
I will say that I don't think this is going to be a very easy issue to resolve (happy to be wrong, though). There's a fundamental mismatch between the model expected by graphics libraries and the model of execution Go presents. In Go, all goroutines (locked to an OS thread or not) are treated equal and are anonymous. This interacts poorly with graphics libraries that care a lot about which thread does what. LockOSThread makes calling into graphics libraries possible, but it doesn't resolve the mismatch.
FWIW, releasing the P isn't just about maintaining GOMAXPROCS (in fact, it kinda doesn't, if the thread ends up doing a whole bunch of CPU-bound stuff for a long time). It's about being able to do schedule goroutines cooperatively. If the P was never released off a goroutine that called into C, then the Go runtime couldn't do a whole bunch of important things (for example, stop all goroutines), because it can't preempt or cooperatively interact with C code. It must be the case that the C code, upon returning to Go, blocks until the Go code is allowed to run again.
If I'm understanding correctly the problem, then I'm even more confused because this is happening on the main thread in a os-locked goroutine with os.LockOSThread(), so nothing else can already run on that thread. There's no other goroutines on that thread that can be starved of work, so what are we yielding to here?
Even when a goroutine is locked to an OS thread, it can still yield back into the scheduler. What happens when it does that is that it puts itself on its P's run queue. It then starts up another thread and hands its P to that thread to run some other goroutine, then puts its own thread to sleep. This is necessary because LockOSThread introduces a 1:1 relationship between a goroutine and an OS thread. Thus if a goroutine locked to a thread blocks, the whole thread must block.
I assume by 20ns you meant 20µs?
My mistake, 20µs yes.
There's a fundamental mismatch between the model expected by graphics libraries and the model of execution Go presents. This interacts poorly with graphics libraries that care a lot about which thread does what.
Well, the largest one that I see is that we usually have a single thread with reliable timing for rendering and then we off-load the heavier computational tasks (fluid simulation, sound, networking, file I/O, etc) asynchronously on the remaining cores. Which ones does what often doesn't quite matter. Go's scheduler would actually improve on a lot of homebrewed schedulers that you see in engines, fully utilising the remaining cores and keeping their workload evenly distributed.
I think that's the fundamental mismatch, it's that Go insist of messing up with the main thread. Specifically, a locked thread.
Beyond the 1:1 for the integrity of the thread-local storage, it comes with the guarantee that there's nothing else running on that thread. It should be able to leverage this. It's dedicated to that one and only task. If it wants to block, that's fine, let it block, the asynchronous workload is elsewhere and there are spare Ps for them.
If the P was never released off a goroutine that called into C, then the Go runtime couldn't do a whole bunch of important things (for example, stop all goroutines)
Would it help Go's scheduler if we could hint that a given Cgo call will not mutate Go's memory, nor callback from C to Go?
Because in this case if it knew that the C call was "safe", the already blocked goroutine could stay blocked, the remaining goroutines could be stopped and the GC can happily STW without being worried about mutators. No need for the strange G/M/P dance.
I'm assuming some check is needed in case C returns prior to the GC finishing, but that seems somewhat doable. The conversative approach that C and the GC can't execute concurrently seems overly restrictive here.
I'll add that games are also written with great care to not generate garbage in the hot path. They pace themselves pretty nicely and I haven't personally seen (with GODEBUG=gctrace=1
) any forceful GC due to outpacing the GC and running out of memory. Maybe it could delay to run the GC just a bit later once we've returned from C land.
I will say that I don't think this is going to be a very easy issue to resolve
I support the idea of a "simple on the surface" and "complex underneath" language, but having something like Cgo and then no mechanism for C and Go to express what's safe and what isn't makes it hard for them to co-exist. I want Go and C to play nicely together.
"All Go Ms will now consist of a primary thread and a syscall thread."
This used to be my biggest gripe, the ffi cost. I'm hoping Vox's proposal will make it a lot cheaper to call into C without all the Go baggage (thanks to the dedicated thread/syscall threads) and I'm hoping somewhere in that process, someone finds a way that allows C (inside locked os threads) to block without compromising Go.
Then the performance problem goes away entirely.
Just a note that we should soon have #cgo nocallback
support. See #56378. I don't know how much it will help this case.
My use case is a Cgo call to glfw.SwapBuffers() which is a wrapper for the C function glfwSwapBuffers(), a common graphics API. When VSync is enabled, it internally blocks to synchronize with the user's window compositor / monitor refresh rate. That brief pause will frequently go beyond 20ns and then some overhead of context switching / launching a separate thread to maintain GOMAXPROCS to not starve the goroutines causes a brutal stutter that can skip 1 to 5 frames and is very noticeable in a soft real-time application.
This seems unexpected to me- context switches are slow, but they're microseconds slow, not milliseconds slow. If you're on windows, be aware that windows is being launched with the default timer granularity of 16.7ms, which applies to native code as well, which could be the issue you're encountering, if swap buffers is timing out. You can work around this by making the traditional dll calls to reduce this to 1ms
@CannibalVox is right and I think I was overly pessimistic in my previous message. The fact that you get multiple frame drops is really significant and you might be running into some performance bug or corner case. I wouldn't expect that from the syscall/cgo slow path to reenter Go, unless the scheduler is really overloaded on CPU-bound goroutines.
@nitrix Please do file a new issue so we can track it. Please also include the following information:
Also, I think I may have introduced some misunderstandings as to how the runtime currently works. I made the mistake of assuming the root cause of your issue, and didn't properly consider the magnitude of the issue, which doesn't match up with my expectations of how the runtime should behave. I tried to clarify below.
Would it help Go's scheduler if we could hint that a given Cgo call will not mutate Go's memory, nor callback from C to Go?
Because in this case if it knew that the C call was "safe", the already blocked goroutine could stay blocked, the remaining goroutines could be stopped and the GC can happily STW without being worried about mutators. No need for the strange G/M/P dance.
It does help some things for sure, but keep in mind that even if the GC stops the world, it does still have to force the C call returning to Go to give up its P so that it blocks before reentering Go code.
I'm assuming some check is needed in case C returns prior to the GC finishing, but that seems somewhat doable. The conversative approach that C and the GC can't execute concurrently seems overly restrictive here.
This is already how it works today: the C code keeps executing until it needs to return back to Go. At that point, the thread checks if it's allowed to run. C code is definitely allowed to execute concurrently with a STW.
I support the idea of a "simple on the surface" and "complex underneath" language, but having something like Cgo and then no mechanism for C and Go to express what's safe and what isn't makes it hard for them to co-exist. I want Go and C to play nicely together.
Agreed. Like I said at the start of this reply, I think the excerpt of mine that you quoted was a bit too pessimistic. In principle, I don't see a reason why the latencies you're seeing should be so high.
This used to be my biggest gripe, the ffi cost. I'm hoping Vox's proposal will make it a lot cheaper to call into C without all the Go baggage (thanks to the dedicated thread/syscall threads) and I'm hoping somewhere in that process, someone finds a way that allows C (inside locked os threads) to block without compromising Go.
I think earlier in this issue @prattmic and @CannibalVox came to the conclusion that a separate syscall/C thread isn't quite the right approach to improving C/Go interop.
To quote @CannibalVox (emphasis mine):
It's safe to say that the design as written won't work, then, but that mainly pushes me toward the alternatives. As you identified, waking and sleeping a thread with every syscall is fairly untenable-
Most of the cost of cgo comes from the fact that Go code wants to be able to stop C code from returning to Go so it can maintain its own invariants. This requires synchronization on both syscall/cgo enter and exit. By having a second thread to switch to, that forces an OS-level context switch on each syscall/call to C, with one thread going to sleep so the other one can run. Currently, goroutines have their own stack, and the runtime directly switches from running on the goroutine stack to the thread stack to perform the C call or syscall (it has to anyway because Go stacks can be really small since they're growable). Go context switches are orders of magnitude cheaper than OS context switches. Also, having a second thread doesn't change the fact that upon switching back to the "Go" thread it may need to block until it's OK for Go code to run again.
(Hope is not lost; there are probably ways to make the synchronization cheaper, but it'll take time and effort to explore and implement.)
But again, I think what you're experiencing may not be quite so fundamental to the design of Go, but actually just a bug or some case that isn't handled well.
Lastly, I also want to address this part of your comment:
... it comes with the guarantee that there's nothing else running on that thread. It should be able to leverage this. It's dedicated to that one and only task. If it wants to block, that's fine, let it block, the asynchronous workload is elsewhere and there are spare Ps for them.
To be totally clear, this is true of even non-locked goroutines calling into C or into a syscall. Before a goroutine calls a syscall or enter C code, it binds itself to the thread it's currently running on for the duration of the call. Then the aforementioned switch to the thread stack occurs. Nothing is kicking the goroutine off the thread, and in fact, the runtime may have to spin up a new thread to run more Go code.
Great clarification. It's on Windows and it's a proof-of-concept 3D game engine for a fancy non euclidean game that has both a C and Go implementation to compare ease of implementation and test the performance. Go is doing incredibly well except for this small stutter every other minute.
I'll follow the advice and branch off in its own tracked issue + collect a trace.
Btw, using the dangerous it-shall-not-be-named fastcgo
[1] makes whatever stutter is happening with the scheduling/cgo issue completely go away. I think we also all know the downsides of using that, but it is a reasonable (albeit unportable and cryptically not easily maintainable) solution.
There have been a couple improvements to windows committed for .23 that may apply: improved timer granularity on windows (might not apply to native code), and some context switch reduction to cgo on unlocked threads (probably won't apply at all in this case). For the sake of thoroughness, consider running the code on tip to see if there are improvements.
Abstract
Prevent longer-than-microsecond syscalls from causing excessive context change churn by eliminating the syscall state altogether. Goroutines will no longer enter into a special syscall state when making syscalls or cgo calls. Instead, the syscall will be executed by a separate syscall thread while the original goroutine is in an ordinary parked state. All Go Ms will now consist of a primary thread and a syscall thread.
Background
There are several ongoing issues with scheduler performance related to decisions to scale up or down the number of OS threads (Ms) used for executing goroutines. In https://github.com/golang/go/issues/54622, the case is laid out that unnecessarily raising threads for a brief boost in workload can have undesirable performance implications. Effectively, this task identifies that the largest extant performance issues in the scheduler today are related to unnecessary thread creation and destruction. However, spinning up threads as a result of syscalls can have much more serious performance implications even than what are identified in the task above:
This usage pattern was recently revealed to be an issue in https://github.com/golang/go/issues/58336, in which it appears that windows network calls via WSARecv/WSASend are blocking rather than nonblocking. A simple go network proxy run in Windows will perform thousands of context switches per second due to long calls repeatedly changing what M the program’s 2 G’s are run on. It does not do this in other operating systems, as those network calls are nonblocking, which allows the G to return to the P it came from without a new M being provisioned on non-Windows systems.
Generally speaking, the behavior of spinning up a new thread for the syscall state is always a problem, the Go team has previously chosen to address it by making short stints in the syscall state not engage in this behavior. By doing so, they have separated syscall behavior into three classes:
Proposal
I propose that every M be created with two threads instead of one: a thread for executing Go code and a thread for executing syscalls. When a goroutine attempts to execute a syscall, it will be carried out on the syscall thread while the original goroutine will stay in a completely ordinary parked state. Other goroutines that attempt to carry out syscalls during this time will park while waiting on the syscall thread to become available. Additionally, if there are other Ps with syscall threads that have less traffic, they could choose to steal G’s that have syscall work.
This will ensure that while longer syscalls will occupy shared syscall resources, which may become saturated, they will not cause M flapping or context switching. In an advanced case, syscall thread contention could be used as a metric for P scaling, and that would be much easier to measure and respond to than the situation right now, where long syscalls spin up additional Ms that don’t easily fit into the existing scheduler architecture and must be dealt with after the fact.
Rationale
The biggest problem with syscall and cgo performance today is that the threads created by long syscalls do not have any place within the go scheduler’s understanding of itself. It has a very tightly tuned understanding of how many Ms should be running and there is no way for it to respond appropriately to a new M suddenly being dumped in the middle of the scheduler, which is what long syscalls do.
Additionally, while moving the P to a new M after a syscall passes the threshold allows the 90% case to perform very well, it also guarantees a context switch in the 10% case, which is often unacceptable. In order to have a guaranteed route for a context-switch-free syscall, we need a route for syscalls to be handled without pulling the existing M away from the P. That means that there must be some sort of dedicated thread for syscalls, somewhere.
Alternatives
Also considered was the idea of a thread pool that lives outside of the M/P/G scheduler architecture and is used to process syscalls. The thread pool would consist of a stack of threads, which would scale between 1 and GOMAXPROCS threads, and a queue of syscall requests. New threads would be added when wait times on the queue passed a certain threshold, and threads would be removed on the garbage collector cadence in the same way items in an ObjectPool are, using a victim list to remove unused threads and eventually spin them down.
While idle threads would make up a much lower % of total program resources, and it is more flexible with syscall contention, this solution would require much more complicated orchestration. It also has a problem with OS-locked threads, since there is no way to guarantee that the same thread services syscalls for a particular P. This problem could be solved by having syscalls on OS-locked threads be executed inline instead of via the pool (OS-locked threads technically never needed the syscall state since there are no other waiting Gs when a goroutine is running a syscall) but this would require a much larger scope of changes within the scheduler.
Another alternative would be to tune the scheduler to prefer to place goroutines that have recently made long-running syscalls into their own P and avoid spinning it down until some time has passed since the last long syscall. We would then choose not to create a new M during long syscalls in cases the origin P has no additional G’s to serve, even if the syscall extended past the threshold. This has the following downsides:
Compatibility
Because this is a change to an internal system, it would not cause language compatibility issues. Additionally, while performance characteristics for large programs without long-running syscalls would change slightly (and this is most Go programs), adding even a few dozen idle threads would not make a measurable difference in Go performance. On the other hand, an entire class of Go applications would suddenly perform much better, including network-heavy applications on Windows.
Late edit: It just occurred to me that another class of go performance would perform much worse unless https://github.com/golang/go/issues/21827 is addressed: parking goroutines OS-locked threads tend to create context switches themselves. Alternatively, the very inflammatory title of this issue could be changed, and the syscall state could be used to indicate "I am currently waiting on the syscall thread to work".