Closed aclements closed 4 years ago
Another solution is to send a signal to the thread to force it to be preempted.
i.e. once a thread is stopped, it masks a certain signal, and if there are some threads running user code that refuse to stop, we send that signal to the current process. In runtime.sigtramp, it can force the switch of the goroutine and then mask the signal.
This solution introduces no overhead to the code, at the expense of complicating the runtime design.
The problem with interrupts is that the goroutine can be at an arbitrary address. The interrupted address is probably at a place where we have no data about where the pointers are. We'd either have to record a lot more frame layout/type/liveness information (including registers), or we'd have to simulate the goroutine forward to a safepoint. Both are challenging.
The problem of inserting artificial preemption points is reduced numerical performance.
Yet another solution is to just ignore the problem. User is supposed to insert runtime.Gosched() calls at appropriate points as needed.
btw, isn't the long term GC plan is to never stop the world?
It seems to me that we can preempt at a write barrier. So the only loops we are talking about are those that make no writes to the heap and make no function calls. If we think in terms of adding
uint8 b
loop:
++b
if b == 0 {
preemptCheck()
}
then the normal path through the loop will have two extra instructions (add/beq) where the add may be to either a register or a memory location, depending on overall register pressure. This will be measurable in tight loops but for most cases shouldn't be too bad.
No pointer writes to the heap.
But maybe this is enough? If we can preempt at the first write barrier, then even if we let the mutator run it can't modify anything that the GC cares about. So maybe we just set the write barrier enabled bit and let goroutines run. Once a goroutine sees the write barrier bit (how do we force that?) it can't modify any memory the GC cares about. So it is safe to start the GC without waiting for every goroutine to stop.
The problem of inserting artificial preemption points is reduced numerical performance.
True. That's why I suggested unrolling loops, which AFAIK is a standard solution to this problem.
However, I think the check can be done in just two instructions, even without adding Ian's loop counter. Just CMP the preempt flag and branch if it's set. That branch will almost never be hit, so it will be highly predictable, and the preempt flag should be in the L1, so the check may in fact be extremely cheap.
No pointer writes to the heap. But maybe this is enough?
This certainly reduces the set of programs that have this problem, but I don't think it actually helps with either of the examples I gave, since numerical kernels probably won't have heap pointer writes, and the runtime test that can deadlock the GC certainly has no pointer writes.
This is the reference for doing a GC safepoint at every instruction. It is not something we want to do for x86 much less the various other architectures we need to support. http://dl.acm.org/citation.cfm?id=301652
Most systems I know of do what Austin suggests, unroll the loop and insert a check. No numbers but 8 unrolls seems to be what I recall. Not only is the branch highly predictable but the check does not introduce any dependencies making it close to free on an out of order machine. There have been other approaches such as code plugging and predicated branches but HW has moved on. I had not seen Ian's suggestion.
On Tue, May 26, 2015 at 9:24 PM, Austin Clements notifications@github.com wrote:
The problem of inserting artificial preemption points is reduced numerical performance.
True. That's why I suggested unrolling loops, which AFAIK is a standard solution to this problem.
However, I think the check can be done in just two instructions, even without adding Ian's loop counter. Just CMP the preempt flag and branch if it's set. That branch will almost never be hit, so it will be highly predictable, and the preempt flag should be in the L1, so the check may in fact be extremely cheap.
No pointer writes to the heap. But maybe this is enough?
This certainly reduces the set of programs that have this problem, but I don't think it actually helps with either of the examples I gave, since numerical kernels probably won't have heap pointer writes, and the runtime test that can deadlock the GC certainly has no pointer writes.
— Reply to this email directly or view it on GitHub https://github.com/golang/go/issues/10958#issuecomment-105710995.
@aclements,
Currently goroutines are only preemptible at function call points.
To be precise, the preemption check seems only at newstack()?
For example,
package main
import (
"fmt"
"runtime"
"time"
)
func test() {
a := 100
for i := 1; i < 1000; i++ {
a = i*100/i + a
}
}
func main() {
runtime.GOMAXPROCS(1)
go func() {
for {
test()
}
}()
time.Sleep(100 * time.Millisecond)
fmt.Println("hello world")
}
The test() is not inline, and the infinite loop calls test(), but it would not preempt at the calls, because the morestack() --> newstack() not involved, so the "hello world" would never be printed.
test() is not inlined, but since it is a leaf it is promoted to nosplit, so it doesn't have the preemption check any more. Fixing that sounds much easier than the rest of this bug. Maybe we could forbid nosplit promotion for functions called from loops?
For the reference, the same problem appeared in the HotSpot JVM. I remember two approaches:
1) Around 1997/1998, Rene Schmidt (http://jaoo.dk/aarhus2007/speaker/Rene+W.+Schmidt) implemented the following mechanism: Threads running a tight loop w/o function calls would receive a signal to temporarily suspend them. The runtime then dynamically generated a partial "copy" of the loop instructions the thread was running, from the current PC to the (unconditional) backward branch, except that the backward branch was replaced with a call into the runtime (leading to proper suspension at a safe point). The thread was then restarted with the pc modified such that it would run the newly generated piece of code. That code would run to the end of the loop body and then suspend itself at which point the code copy was discarded and the pc (return address) adjusted to continue with the original code (after the safe point).
This mechanism was ingenious but also insanely complicated. It was abandoned eventually for:
2) A simple test and branch (2 additional instructions) at the end of a loop body which didn't contain any calls. As far as I recall, we didn't do any form of loop unrolling and the performance implications were not significant in the overall picture of a larger application.
My vote would be for 2) to start in loops where @ianlancetaylor's suggestion doesn't work. I suspect that for all but the smallest long-running inner loops (where unrolling might make sense independently), the performance penalty is acceptable.
If this is not good enough, and we don't want to pay the code size cost of unrolling a loop multiple times, here's another idea based on 1): Instead of the backward branch check as in 2) plus unrolling, keep the original loop, and generate (at compile time) a 2nd version of the loop body that ends in a runtime call to suspend itself instead of the backward branch. The code size cost is about the cost of having the loop unrolled twice. When the goroutine needs to run to a safe point, use a signal to temporarily suspend the goroutine, switch the pc to the corresponding pc in the copy of the loop body, continue with execution there and have the code suspend itself at a safe point. There's no dynamic code generation involved, generating the extra loop body is trivial at compile-time, the extra amount of code is less than with loop unrolling, and there's only a little bit of runtime work to modify the pc. The regular code would run at full speed if no garbage collection is needed.
Another idea, similiar to @ianlancetaylor's idea, is to estimate the cost (in ns) of the loop and only check for required suspension every N loops through that loop body.
Once the compiler can unroll/unpeel, that check-every-N logic can be either inlined after the unrolled body, if unrolling is beneficial, or kept when unrolling makes no sense for that loop body.
Such logic also reads better when using debuggers.
@nightlyone The check-every-N seems more complex than just having two extra instructions (compare and branch). And if unrolling is done, that compare-and-branch is already needed only every N loop iterations only (where N is the number of times the loop was unrolled).
@griesemer not sure how the test will avoid atomic loads, that's why I suggested the check-every-N.
So the pseudo-go-code before unrolling would look like this:
loop:
// loop-body
if counter > N {
counter = 0
if need_stop = atomic.LoadBool(&runtime.getg().need_stop); need_stop {
runtime.Gosched()
}
}
counter++
goto loop
after unrolling it would look like
loop:
// loop-body0
// loop-body2
// ...
// loop-bodyM
if counter > N/M {
counter = 0
if need_stop = atomic.LoadBool(&runtime.getg().need_stop); need_stop {
runtime.Gosched()
}
}
counter++
goto loop
So the code inserted would be a constant overhead, but still run every N loops.
The load doesn't have to be atomic. The current preemption check in the function prologue isn't atomic.
One tricky bit with the compare and branch is that the actual preemption point needs to have no live registers. Presumably, for performance reasons, we want the loop to be able to keep things in registers, so the code we branch to on a preempt has to flush any live registers before the preempt and reload them after the preempt. I don't think this will be particularly hard, but it is something to consider, since it might affect which stage of the compiler is responsible for generating this code.
@aclements Indeed. Which is why perhaps switching the pc to a 2nd version of the loop body that ends in a safe point might not be much more complex and permit the loop to run at full speed in the normal case.
@aclements Indeed. Which is why perhaps switching the pc to a 2nd version of the loop body that ends in a safe point might not be much more complex and permit the loop to run at full speed in the normal case.
From the runtime's perspective, I think this would be more complicated because stealing a signal is a logistic pain and we'd have to deal with tables mapping from fast loop PCs to slow loop PCs. The compiler would have to generate these tables. This seems like a very clever plan B, but I think first we should trying adding a no-op compare and branch and see if it's actually a problem for dense numerical kernels.
The new SSA register allocator handles situations like this well, keeping everything in registers on the common edges and spill/call/restore on the unlikely case.
The load is not dependent on anything in the loop. Assuming the loop is tight, the value is almost certainly to a location already in the L1 cache, the branch will be highly predictable. Just to make sure the branch predictor doesn't even need to be warmed up, we can make it a backward branch. I would be sort of surprised that if on an out of order machine the cost would even be noticed. That said build and measure is the only way to be sure.
On Fri, Sep 11, 2015 at 3:38 PM, Keith Randall notifications@github.com wrote:
The new SSA register allocator handles situations like this well, keeping everything in registers on the common edges and spill/call/restore on the unlikely case.
— Reply to this email directly or view it on GitHub https://github.com/golang/go/issues/10958#issuecomment-139643973.
CL https://golang.org/cl/19516 mentions this issue.
It would be great if there is a complier flag to detect greedy goroutines.
We're re-writing our product from scratch in go, and one of more memorable failure modes of the prior implementation was an infinite loop in an open source library, causing lots of customer pain. So we'd love to see this addressed. (Sadly this fixed alone isn't enough as you still ideally need a way to find go routines which are spinning and stop them).
So our use case is maintaining high availability in the face of a number of errantly looping go routines.
Moving to Go1.8Early since we've been getting more problem reports from this lately (though I think they're still all from either testing code or buggy code).
@yaxinlx, I'm not sure what you would want the compiler to report. Most code is probably has lots of non-preemptible loops, but they're all trivially short and bounded, so I suspect any compiler output would be too noisy to be useful. I think it would be much better to just fix this issue than to put effort into complex tools for dropping it in users' laps.
I wanted to note an interesting case that hasn't occurred to me until recently: currently we omit the stack growth check from functions with very small frames (the exact value differs between arches), which means we also don't check for preemption in these functions. As a result, even a loop that contains a function call may be non-preemptible if the function it calls has a very small frame. That function call could even be a method call or a function pointer, in which case the compiler has no way to soundly detect that it's calling a function without a preemption check when it's compiling the loop.
Any fix for this issue has to deal with this case. I can think of a few possible solutions, and there may be others: 1) put preemption points in all function prologues or 2) insert preemption points on back-edges conservatively if the compiler can't prove that at least one call in the loop is preemptible. As an extension of 2, we could also 3) insert preemption points in all method prologues, which would increase the number of loops the compiler can prove already contain a preemption point.
CL https://golang.org/cl/31766 mentions this issue.
@rhysh found an interesting example of this problem in the standard library in #17831 : the base64 decoding loop in encoding/base64.Encoding.decode has no preemption points, so if you're base64 decoding something large it can significantly delay GC.
From what I understand, @rhysh's case is particularly interesting since the application is more throughput-sensitive than latency-sensitive, yet the STW delays significantly reduce parallelism and in turn significantly reduce throughput. So, in this case, the overhead of the backwards-branch check would certainly pay for itself in overall application throughput.
Any reason not to handle this as a special case, split it into two loops, outer explicitly checks for preemption, inner runs for few enough iterations we don't care?
@dr2chase Isn't that effectively the same thing as adding a loop counter?
@dr2chase, by "special case" do you mean specifically encoding/base64? This certainly isn't the only loop like this in std (this same pattern shows up all over encoding/* at least: ascii85.Encode, base32.Encode/decode, base64.Encode, hex.Encode/Decode, etc).
I was imagining modifying the library by-hand, but if there's a dozen instances already, maybe we should reconsider that. I imagine something along the lines of replacing
for i := 0; i < N; i++ { ... }
with
j := 0
for ; j <= N-1000; j+= 1000 {
for i := j; i < j+1000; i++ { ... }
preemption_check()
}
for i := j; i < N; i++ { ... }
It's slightly more efficient than running a separate counter, not sure we care. There might be some advantage if we have ambitions of SIMD-fying inner loops, since we'd probably do a transformation like that anyhow.
@dr2chase, I feel like we may be getting ahead of ourselves here. If we have the cycles, I would like to try an experiment with the simplest possible approach to this: just have the compiler insert stack bound checks in control flow cycles with no preemptible calls and see how much overhead that really has. @cherrymui thinks this should be an easy thing to try. If we start with that experiment, we can get a better sense of whether more sophisticated techniques are even necessary.
https://go-review.googlesource.com/c/33093/ adds GOEXPERIMENT=preemptibleloops, which enables the compiler to add preemption points to all loops. As of CL 4, this adds preemption points to essentially all loops, even if they already have preemption points. The overhead is quite high on some benchmarks, but it's clear that there's also a lot of headroom available to improve this. The current plan is to commit this to Go 1.8 under the GOEXPERIMENT for diagnostic purposes, and to try to reduce the overhead and enable this by default for Go 1.9.
Some things we can do to improve the overhead (many of which have been discussed above):
We can record whether a function is nosplit or not, including auto-nosplit, in the export data, so I don't see any reason here to disable auto-nosplit. Certainly we should only instrument non-preemptible loops, which is to say, any case where there is a path from L to a branch to L that does not call any non-nosplit functions.
I think signals are problematic both because they are slow and because relying on them will make the Go runtime less portable to different environments.
I continue to believe that the simplest approach with minimal overhead is to introduce a new temporary variable into every non-preemptible loop. Rewrite each backward branch to L to branch instead to L':
L':
tmp--
if tmp == 0 && runtime.preempt {
preempt()
tmp = 16384
}
goto L
That should be a straightforward AST transformation early in the compiler. Let the rest of the compiler decide whether to put tmp into a register or keep it on the stack. Either way it will be faster than a memory load from shared memory.
CL https://golang.org/cl/33093 mentions this issue.
Here's a quick comparison of all-backedges vs call(including nosplit)-sensitive timings, where call-sensitive here is best-case improvement (w/o unrolling or other clever games) since it is treating nosplit functions as doing scheduling checks.
all-backedges (current CL): AMD64, geomean+5%; worst: RevComp+45%, MandelBrot+20%, Gzip+11%. PPC64, geomean+4%; worst: MandelBrot+67%, Fannkuch+16%, Revcomp+9%.
call-sensitive: AMD64, geomean+4.7%; worst: RevComp+32%, MandelBrot+21%, Gzip+11%. PPC64, geomean+2%; worst: MandelBrot+74%, FmtFprintfPrefixedInt+8.2%, FannKuch+6.8%
Not clear if something funky's going on with the PPC box; there was another person logged on, and it has a huge number of processors, but three of the RegexpMatchEasy family sped up, by 1, 9.8, and 11.8%. I checked this again with an overnight 100-run test, and got the same results.
I think we have some unrolling in our future, but this suggests that the existing GOEXPERIMENT CL is a reasonable place to stop, especially for AMD64.
We can record whether a function is nosplit or not, including auto-nosplit, in the export data, so I don't see any reason here to disable auto-nosplit.
That's fine when calling functions in other packages, but if it's a call to the same package, auto-nosplit isn't determined until long after the pass that inserts preemption points.
@dr2chase, is it convenient to take a look at how tight the generated code is? Not necessarily try to improve it right now, just to get a sense of how much codegen improvement there may be. And would it be easy to try @ianlancetaylor's suggestion?
To try @ianlancetaylor's suggestion I think is not terribly hard, especially if we aren't picky about resetting tmp before each loop, especially if we aren't worried about irreducible loops, especially if we so all loops. Inserting phi functions for tmp coming from different loops is a minor problem, but it's been solved before.
I think the code quality issue is more a matter of what behavior we induce in the register allocator.
Perhaps we shouldn't reset tmp between loops. If tmp is function-wide and only set at entry to the function, we would hit reasonable preemption points regardless of the loop structure of the function. Otherwise, if there are several loops, we may not hit the threshold in any individual loop, but the function could still run for a long time.
CL https://golang.org/cl/33910 mentions this issue.
Update: Dec14 version of the CL above is a sketch of what we want; it is correct and adequate for GOEXPERIMENT, but I think needs the following changes/enhancements:
the register allocator needs to be improved to "sink" spills down into rarely taken branches when this is possible. The current code to sink spills into loop exit edges is a special case of this.
loop unrolling needs to happen sometimes.
we might want to disable the nosplit optimization in the assembler so that it was easier to exclude cases where there was already a call in the loop. Cherry measured the gains from that optimization and it was not large.
to be more predictable in pause times, we need to estimate the amount of time per loop iteration and decrement the counter in proportion. The compiler can count shortest/long paths in terms of SSA instructions; this may require smarter placement of decrements (which will in turn require more phi functions) but let's not worry about that yet.
the constant used to reinitialize the counter probably needs to be precomputed by the runtime based on the processor where a program is run.
As an example of picking counter size, suppose our goal is to check for interrupt every 50 microseconds. Suppose a 1 GHz processor; 50 usec = 50,000 cycles. A 10 "instruction" loop executed at one IPC should check every 5000 iterations, and let's say that no loop is shorter than 10 instructions through unrolling; if the counter is initialized to 5000, then decrements by 1 will obtain the desired check latency. If the processor speed is 2GHz, then initialize the counter to 10000 (10000 iterations, each taking 5ns), if it s 500Mhz, then initialize the counter to 2500. Loops up to 15 instructions decrement by 1, 15-25 by 2, 25-35 by 3, etc, that is a compiled-in constant, but the runtime picks the starting value to ensure that checks happen often enough.
This is only to get checks to within an order of magnitude, but not adjusting for loop size would fail by several orders of magnitude. Depending on measured overheads, we might also decide to check more aggressively.
This issue has an interesting interaction with the proposed fix for https://github.com/golang/go/issues/18717: if we don't preempt tight loops, async signals may never be delivered when running Go programs with the C thread sanitizer (because we'll never poll the libc hook).
Could we dedicate an OS thread to poll the async signals when tsan is active?
Wouldn't help. TSAN stores the deferred signal in a thread-local, so it only gets delivered to that same thread.
On Jan 28, 2017 9:44 AM, "Bryan C. Mills" notifications@github.com wrote:
Wouldn't help. TSAN stores the deferred signal in a thread-local, so it only gets delivered to that same thread.
I feel that we should modify TSAN to do a cgo callback when signals are received instead. Relying on calling certain libc functions to get signal delivered doesn't look very elegant.
I feel that we should modify TSAN to do a cgo callback when signals are received instead. Relying on calling certain libc functions to get signal delivered doesn't look very elegant.
Elegant or not, making the libc
call is a more robust fix: it works with the TSAN versions that people have today, without requiring them to rebuild their libc and/or C toolchain from head.
At any rate, this part of the discussion probably belongs on #18717. The reason for the comment here was just to add another data point for the importance of ensuring that loops are preemptable.
FYI, addressed (without explicit mention) in GOEXPERIMENT=preemptibleloops from https://go-review.googlesource.com/c/36206/ . This uses stack bound instead of a counter, and also fixes a rare-bug that could be tickled in the 1.8 GOEXPERIMENT. The stack bound checks turns out to actually be slightly faster and it also recognizes the reschedule request more quickly.
Another example of this problem, this time in Influxdb: https://groups.google.com/d/topic/golang-dev/PVwDFD7gDuk/discussion. Specifically, this loop is non-preemptible in some situations and has O(n^2) behavior. They observed this delaying the GC's transition from mark 1 to mark 2 for minutes at a time.
Would it make sense to allow the programmer to explicitly mark those goroutines that they suspect might require preemption? The idea would be to (initially at least) focus the optimization efforts only on the goroutines so marked. It might allow for a more aggressive timeline in adding this optimization, as it may take longer for it to become well-polished enough to be applied across all goroutines.
What precisely constitutes a pre-emption point seems to be somewhat subtle and subject to change; but an author of a goroutine that lacks the well-established pre-emption points could conservatively mark it as requiring pre-emption.
CL https://golang.org/cl/46410 mentions this issue.
( Coming from this thread, thanks Chris )
I don't think it's a good idea at all just to insert some codes in any tight loops try to preempt. It will make the golang more slower and the model would be more harder to understand and harder to control. And this is maybe a region that golang should never "meddle in" with if golang want to make its performance comparable with C/C++.
There may be some better solution:
If golang is not good at it, tell users directly its shortcomings, write it into go tour specially, do not let the user find it out by themselves when they found their go application is thrashing and churning (that is terrible honestly).
If go's runtime find some goroutines never yields the processor for a certain long time (could be configured by the user), then output some error information (the goroutine name or even hotspot function name or sth. ) to user, let them know immediately and let them to decide what to do next ( fix it at the source code level or whatever) instead of sitting there and waiting for a pretty long time or even forever (that is fatal for some real-time application and unacceptable).
Last, agree with the idea of @minux, if it is possible, change the GC plan to never-stop-the-world for good. Stop-the-world gc plan let some cpu intensive application's overall latency drops badly (detailed proof).
PS:
Recently I have implemented a sub-option in the go tool trace
named "diagreedy"
which could "diagnoses and finds out the cpu greediest several goroutines". This
tool already helped us track down several deep hidden problems in our go applications
and achieved more stable and short GC pause latency by fixing them afterwards.
If you are interested in this little tool, there is the repository and the go proposal issue.
Currently goroutines are only preemptible at function call points. Hence, it's possible to write a tight loop (e.g., a numerical kernel or a spin on an atomic) with no calls or allocation that arbitrarily delays preemption. This can result in arbitrarily long pause times as the GC waits for all goroutines to stop.
In unusual situations, this can even lead to deadlock when trying to stop the world. For example, the runtime's TestGoroutineParallelism tries to prevent GC during the test to avoid deadlock. It runs several goroutines in tight loops that communicate through a shared atomic variable. If the coordinator that starts these is paused part way through, it will deadlock.
One possible fix is to insert preemption points on control flow loops that otherwise have no preemption points. Depending on the cost of this check, it may also require unrolling such loops to amortize the overhead of the check.
This has been a longstanding issue, so I don't think it's necessary to fix it for 1.5. It can cause the 1.5 GC to miss its 10ms STW goal, but code like numerical kernels is probably not latency-sensitive. And as far as I can tell, causing a deadlock like the runtime test can do requires looking for trouble.
@RLH