runtime,iter: the interaction of `Pull` with `runtime.LockOSThread` is currently buggy, and generally not well-defined

Problem

The current implementation of iter.Pull in the rangefunc experiment creates a new goroutine and uses an optimized goroutine switch path during iteration.

The problem with this implementation, as it stands, is that it interacts poorly with runtime.LockOSThread. If the goroutine spawned for the iterator (let's call it the iterator goroutine) calls runtime.LockOSThread, first and foremost, that call will not be respected. For example, if the next function is passed to a different goroutine, the runtime will not schedule the iterator goroutine on the thread it was locked to. Bad things might happen. Simultaneously, even in the easy case, if the iterator goroutine always switches with the same calling goroutine on the same thread, then once that iterator goroutine exists, it will take down the current thread with it. This is expected, but this happens before the calling goroutine gets placed back into the scheduler's ownership. The calling goroutine instead stays blocked forever with no hope of executing it again.

One possible solution

The most straightforward fix to this is to fall back to a slower kind of switching of goroutines if either the iterator goroutine or the calling goroutine locks itself to a thread. This slower kind of switching would probably be done with channels, and would have semantics that are equivalent to the existing implementation of iter.Pull. Because at least one of the goroutines is locked to a thread, the overhead of OS context switching also comes into play, so this implementation would likely be dramatically slower.

This fix, then, has a really unfortunate performance cliff. One could imagine that a UI library (for example) locks a goroutine to a thread because it needs to interact with C libraries that require staying on the main thread. Using iter.Pull on such a goroutine would be a very, very bad idea if this workaround was adopted. The resulting performance regression would be very surprising.

The proposed solution

Discussing with @aclements, @dr2chase, and @cherrymui, we have an alternative fix that avoids the performance cliff. Specifically, we would restrict iterators from being able to call runtime.LockOSThread by panicking if called from an iterator goroutine (the runtime handles the iterator goroutine's creation, and thus can enforce this). Meanwhile, the calling goroutine (the caller of next, that is) is fully free to be locked to an thread.

If the iterator is prevented from locking itself to a thread from the iterator goroutine, then it's totally fine for that iterator goroutine to just execute on the locked calling goroutine's thread. That iterator goroutine effectively gets locked to the same thread while its executing, and we know that it will eventually switch back to the calling goroutine. It's fine for the calling goroutine to send next to another goroutine as well, even if that other goroutine is locked, because it can only execute once the iterator goroutine has ceded control of the thread (by the semantics and implementation of iter.Pull).

One hazard of this approach is that the iterator, running on the iterator goroutine, may care about thread-local state, and threads locked to goroutines tend to be locked precisely because there is some important thread-local state that needs to be preserved. However, if the iterator goroutine really cared about thread-local state, it should call runtime.LockOSThread, which in this case, will panic. That does mean that the iterator is slightly restricted in what it can do, but we feel that it's better for the code author to find that out rather than deal with some subtle issue related to runtime.LockOSThread and local thread state not being what they expect.

Furthermore, we think it's very unlikely that an iterator will want to lock itself to a thread anyway, so the extra restriction is unlikely to matter to the vast majority of Go users. Simultaneously, making this choice lets us keep iter.Pull fast for many more (more realistic) use cases, lets us avoid the pain of maintaining two implementations of iter.Pull, and prevents surprising performance cliffs for users of iter.Pull.

Thank you very much for your detailed comment.

Your section about function rewriting and Go interpreters made me realize that function rewriting is irrelevant for iter.Pull in a basic sense: there's no control flow to rewrite.

Given iter.Pull is implemented with goroutines, the remaining question relevant to this issue is the LockOSThread behaviour.

That is, cases where (1) iter.Pull is required, and (2) the iterator must run on the same locked thread as the caller.

Maybe try to rephrase how you think of this requirement as 'On the statement go g() the function g must run on the same locked thread as the go g() statement'. I'll leave it to the folks that work on the runtime to address whether/how this is doable.

I tried and failed to understand this rephrasing. Can you elaborate? What I'm proposing is not that two goroutines must run on the same locked thread. I'm proposing that next and yield both act as cooperative scheduling points. That is, the thread entering next or yield is the same thread that resumes execution from the return of its counterpart. Calling next from (locked) thread T will see yield return (locked) on thread T and vice versa.

I don't have enough knowledge of the Go runtime to know whether such behaviour is easy to implement, but I presume the runtime scheduler is already involved every time a goroutine sends or receives from a channel, which is what next and yield do.

I'm proposing that next and yield both act as cooperative scheduling points. That is, the thread entering next or yield is the same thread that resumes execution from the return of its counterpart. Calling next from (locked) thread T will see yield return (locked) on thread T and vice versa.

Seems like I misunderstood what you were proposing. I will let others comment on the plausibility of implementing this. The internal coroswitch seems like a very special channel, but I don't know enough details.

I will point out that seq might call yield from another goroutine. So even if next and stop are guaranteed to switch to yield on T when LockOSThread is true, seq may still execute on a different goroutine after yield returns. seq is just some user provided function so it could do this. So a caller of iter.Pull(seq) that cares about LockOSThread would need be aware of some details of the implementation of seq. Maybe good enough?

I'm proposing that next and yield both act as cooperative scheduling points. That is, the thread entering next or yield is the same thread that resumes execution from the return of its counterpart. Calling next from (locked) thread T will see yield return (locked) on thread T and vice versa.

Seems like I misunderstood what you were proposing. I will let others comment on the plausibility of implementing this. The internal coroswitch seems like a very special channel, but I don't know enough details.

I will point out that seq might call yield from another goroutine. So even if next and stop are guaranteed to switch to yield on T when LockOSThread is true, seq may still execute on a different goroutine after yield returns. seq is just some user provided function so it could do this. So a caller of iter.Pull(seq) that cares about LockOSThread would need be aware of some details of the implementation of seq. Maybe good enough?

Indeed. On the other hand without cooperative scheduling, a caller of iter.Pull that cares about LockOSThread would need to be aware of the different behaviour from rangefunc: from the perspective of a LockOSThread user, rangefunc function rewriting effectively cooperatively schedules between the caller and iterator.

So it seems to me it's a trade-off between two kinds of sharp edges.

With the freeze approaching, we need to lock in an answer to this issue. Building on everyone's input, this post explains what we plan to implement for this release. We're sure this won't please everyone, but we believe this handles the major use cases efficiently, we know how to implement it in the next few weeks, and it's flexible enough that we believe it can be relaxed in the future if necessary.

One thing to keep in mind here is that LockOSThread is a spanner in the works; this is not first-class semantics, this is something that Go does to support thread-linked state in operating systems, which in turn is mostly an artifact of C lacking closures, the resulting proliferation of global state, and the collision of that global state with threads and multiprocessors. Explicit use of LockOSThread gets second-class support here. Iterators over state that requires LockOSThread will either not work with Pull, or will require an extra layer of pushing results back and forth to an additional (locked) goroutine.

The second thing to keep in mind is that Pull iterators, even with coroutines, are slower than plain iteration. Thus, we expect people to use Pull iterators only when necessary. Their primary use case is operations over 2 (or more) iterators, and because of their higher cost, it's likely that these binary (or more-ary) operations will not even be symmetric; they'll iterate normally over one operand, and Pull the others.

TLDR, what we propose to implement is:

At creation of the coro backing the iter.Pull, snapshot the thread lock count and (if non-zero) the thread. Every later coro switch can donate the locked thread across the switch, assuming the thread lock count and (if non-zero) thread match the snapshot; otherwise it throws.

The effect of this is that the caller of iter.Pull may be on a locked thread, as long as it doesn't lock or unlock the thread and then call next (though it may change the lock count transiently). And, likewise, the iterator function may lock or unlock the thread, as long as it doesn't do it around the call to yield.

The advantage of this proposal is that it maintains good performance composability. That is, Pull iterators remain fast in the presence of the two oblivious uses of LockOSThread.

If someone uses a Pull iterator in a library, and that library is used by some goroutine that happens to have already locked itself to an OS thread, that will all run well (quickly). The author of the library has no way of knowing this context, and the caller of the library has no way of knowing that they use Pull.
Or, if someone calls LockOSThread transiently, for example in a logging function, that will not force the coroutine to its own thread, and the pulled iteration will continue to be as efficient as possible.

The disadvantage of this proposal is that if someone wishes to write an iterator over some OS thread state, if that iterator should work properly when converted to a Pull iterator, then either

the user of the iterator must know this so they can call LockOSThread before converting the iterator to a Pull iterator. This is somewhat ugly.
or, the iterator must create its own additional goroutine which calls LockOSThread, and pass control and values back and forth to that goroutine with channels. This is somewhat extra-slow.

And throw is not great, but this is not a recoverable error; if the thread locking state isn't right, the runtime is not in a correct state.

One thing to keep in mind here is that LockOSThread is a spanner in the works; this is not first-class semantics, this is something that Go does to support thread-linked state in operating systems, which in turn is mostly an artifact of C lacking closures, the resulting proliferation of global state, and the collision of that global state with threads and multiprocessors. Explicit use of LockOSThread gets second-class support here. Iterators over state that requires LockOSThread will either not work with Pull, or will require an extra layer of pushing results back and forth to an additional (locked) goroutine.

*cough* I would like to make you aware that LockOSThread isn't necessarily tied to C, but instead is essential in order to deal with Linux thread (task) state in form of not least namespace attachments and other state such as (effective) UID and GID. This is independent of any C closure design or lack thereof, this is OS-level specific independent of any particular language ecosystem. Albeit in case of UID and GID state I would assume this applies to other unix-like OSes with OS native thread/task support as well.

IIRC, Docker/runc was one of the prominent use cases to introduce LockOSThread when it switched to the Linux namespaces technology (which IIRC wasn't the initial implementation).

If anything, you'll need to put up a big, very bright flashing warning sign in Go's documentation to prevent contributors to say, any of the existing container engine/executors code bases to infect them with unexpected and difficult to diagnose bugs -- calling this "second class support" is a totally misleading euphemism for the problems this can and probably will cause.

@thediveo , yes, you're right of course that LockOSThread is necessary for many OS interactions, and we apologize for misrepresenting its use cases. I believe the semantics we're moving forward with shouldn't actually cause problems with systems-level code that needs to mix LockOSThread and iter.Pull, though. This approach has much better composability than the original proposal, without causing unpredictable performance cliffs. There are also ways we can relax it if we hear that it's too constrained in practice.

Change https://go.dev/cl/583675 mentions this issue: runtime: fix coro interactions with thread-locked goroutines

golang / go