Open hnes opened 7 years ago
If you are interested in this little tool, there is the repository.
Sorry, I might had make a mistake:
Tmax_stw = sum{top max N on-processor time slices}
It should probably be:
Tmax_stw = max{top max N on-processor time slices}
Although I hadn't dive in the go scheduler's code yet, but I guess the 2nd is most likely the right answer, because the 1st algorithm is just too inefficient and unlikely using by the go scheduler.
But no matter which one is the right answer, the 1st one or the 2nd, it couldn't change our conclusion made before utterly.
/cc @aclements @RLH
/cc @dvyukov
Thanks a lot, rsc :)
Nice tool.
Moving forward the Go compiler needs to be taught how to preempt tight loops so, while valuable, the usefulness of this tool is short term. The unfortunate downside is that once the tool is in the source base it would need to be supported even though its usefulness will be minimal once the compiler adds loop preemption.
On Mon, Jun 26, 2017 at 3:53 PM, Russ Cox notifications@github.com wrote:
/cc @aclements https://github.com/aclements @RLH https://github.com/rlh
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/golang/go/issues/20792#issuecomment-311163861, or mute the thread https://github.com/notifications/unsubscribe-auth/AA7Wn34dy3_FVSqGspt0jSod-__54uniks5sIAxKgaJpZM4OEhc5 .
Thanks Richard for your reply :)
Honestly, I don't think this tool would become useless even after our compiler adds a 'mature default-on or even always-on loop preemption option'. And there are several reasons.
The approach of loop preemption
cannot eliminate all cpu greedy problems, there are just too many difficult and sticky corner cases in cpu greedy problems that loop preemption
cannot handle them all efficiently right now (or maybe would pay a very dear price to achieve it, list them all, and fix them all efficiently? It's just too expensive even you could list all the corner cases).
Even when the loop preemption
is confident about itself that it can eliminate all cpu greedy problems and all the difficult and sticky corner cases, and after the option loop preemption
is always enabled in the go compiler, shouldn't we still need a tool for the users to test out of the box that wether the go binary is working as the compiler promised ( which is 'all cpu greedy problems is solved by option loop preemption
') and as they intended to? And then, if they still found some cpu greedy hot path, they could optimize it by themself if they want to or report it as a issue, isn't be even better?
Personally view, although loop preemption
is a pretty good feature but it shouldn't be always 'on' or even be default 'on', it should be default 'off' even when it is 'mature' (and to what extent you could believe it is mature?). And because 'loop preemption' is just a little 'anti-common-sense' and radical, it should not be even a global option. It should be a option which has a granularity of every single source files (I do not know whether it is or not now already).
Go already has tons of open source projects and libraries which they only have the assumption of 'preempt only at function calls' at that time when they were developed. And there maybe something would turn wrong when the compiler performs the 'loop preemption transformation' for all of them. It is a little dangerous and risky as many go applications plays a very important role in their systems.
With the assumption of 'preempt only at function calls', users can do some high priority batching jobs without been blocked by scheduler in a very high speed and low latency manner. But with the option 'loop preemption' is 'on', such jobs will become less efficient and the latency will be worse.
So, the trade-off of the overhead of 'loop preemption' should be chosen and decided by the users themself: have it or not. (As far as I know, the current implementation of 'insert_resched_checks' is 'The simpler the loop, the worse the overhead, that is why this phase is not normally enabled. -- from David Chase', and it maybe not a quite easy job at all to bring a efficient and complete 'loop preempt' function to our users in a short time, one year? for instance.)
So above all, I think it would be better to provide a diagnose tool for the users to see the status of their goroutines' cpu time slices and optimize their codes by themself if they really want to, no matter the 'loop preemption' is default 'off' now or may default 'on' in the future. Still, it maybe often better to provide more choices to our users then to limit them to a single one.
The approach of loop preemption cannot eliminate all cpu greedy problems, there are just too many difficult and sticky corner cases in cpu greedy problems that loop preemption cannot handle them all efficiently right now (or maybe would pay a very dear price to achieve it, list them all, and fix them all efficiently? It's just too expensive even you could list all the corner cases).
I'm not sure I understand your claim, or perhaps I don't understand your definition of "cpu greedy problems". Can you give an example of such a problem that loop preemption doesn't solve?
I also wouldn't call the cost "very dear". Our current prototype of loop preemption shows a ~2.5% geomean slow-down across the go1 benchmarks, which are CPU-intensive and have several tight loops. Each release generally speeds up code by more than that just through code generation improvements.
Hi Austin, I think you are actually underestimating the distance and the difficulty of the gap between 'local optimization' and 'perfect and optimal solution' of the "cpu greedy problems" in golang.
The definition of "cpu greedy problems" is:
A cpu greedy goroutine is a goroutine which would very rarely
do the things that may let he himself been scheduled away from the cpu he had already been occupied. The frequency of the word 'rarely' here is the degree that could let the others starve a little or even to death.
(this thread had discussed such influence)
The 'cpu greediest goroutine' example maybe something like this:
go func(){
for{}
}()
But the 'loop' is only one kind big ordinary case of such problems, how about this one:
_adder := func(c *uint32, depth uint32) uint32 {
atomic.CompareAndSwapUint32(c, 0, 333)
atomic.CompareAndSwapUint32(c, 0, 333)
atomic.CompareAndSwapUint32(c, 0, 333)
// ....
/* omitted, because there are */
/* 100,000 of them :( */
atomic.CompareAndSwapUint32(c, 0, 333)
atomic.CompareAndSwapUint32(c, 0, 333)
atomic.CompareAndSwapUint32(c, 0, 333)
return atomic.LoadUint32(c)
}
godeadloop := func() {
var ct uint32
for {
ct++
ct = _adder(&ct, 1)
ct = _adder(&ct, 1)
ct = _adder(&ct, 1)
// ....
/* omitted, because there are */
/* 100,000 of them :( */
ct = _adder(&ct, 1)
ct = _adder(&ct, 1)
ct = _adder(&ct, 1)
}
}
go godeadloop()
intact version source code is in here
In this example there is no loop anyway right? But it is pretty 'cpu greedy' and this kind of piece code could let other goroutines to starve or even to death (whole go process would stuck for several minutes probably I think if the STW happened together).
Of course that you could solve this problem by a new 'insert_more_resched_checks' to the go AST after our discussion, but this is not the point, the point is the 'loop preempt' solution we currently use for the "cpu greedy problems" already had neglect this one corner case by mistakenly, how could we know there is not some more corner cases just hidden there that we didn't find them yet?
If you really think the 'loop preempt' solution could be the 'perfect and optimal solution', then could you write a formal and strict mathematical proof for it then? If you couldn't, we can only use it as a suboptimal solution. If we use a 'suboptimal solution' as a 'perfect optimal solution', terrible things may happen.
So, this is why I still insist on to add some diagnosis tool like 'diagreedy' to golang, you could use it as a verification tool or whatever, just give some more information to our users and let them know more about what is happening deep inside the go processes, it is also a win-win solution not only for the users, but also for the developers.
I also wouldn't call the cost "very dear". Our current prototype of loop preemption shows a ~2.5% geomean slow-down across the go1 benchmarks, which are CPU-intensive and have several tight loops. Each release generally speeds up code by more than that just through code generation improvements.
If you have a large computer cluster which worth 100,000,000 dollars, then the improvement of 2.5% on the performance could save you 2,500,000 dollars. And 2,500,000 dollars could do many valuable things :D
For some people who prefer performance, please give them other choices, do not close the doors to them, all they need to do is just to analyze the trace.out the go process have dumped on the fly.
"Loop preempt" is a good tool, I think a good usage of it is to enable the "loop preempt" on some opensource project that we don't trust yet and disable them on our own reliable code to have a better performance and controllability.
( Sorry, I might have a impression that you may want to do anything for the users, but that would be some kind of over optimization if you do too much of it :p )
The compiler has already been taught to identify problematic loops. I'd suggest that a good next step would be to print out that information when requested, perhaps with -m
or -m -m
or with a new flag (regardless of whether the code generation will be altered). That is likely to be helpful long term anyway, and it is very likely a small, unobtrusive change. If it is small and safe enough, maybe (but just maybe) it could even be snuck into 1.9.
cc @dr2chase @bradfitz
A very good idea, and I agree with you 👍
The more tools and options to diagnose the possibly problematic code the better.
(But one problem here is that maybe nearly all 'for loop' uses in the sources will be reported as 'problematic'. Although I don't know the details implementation of the current 'loop preempt' in the compiler. But the optimization is often one of the most difficult parts in compiler science ( because compiler can not read the mind of programers). And the optimization nearly all is 'suboptimal solution' and there is hardly any 'perfect optimization'. This why I still also insist on to provide some more external diagnosis tools to help programmers to improve their code if they want. )
I also thought that some code had been wrote with go assembly language like crc32
would be very cpu intensive and has some 'loops' that insert_resched_checks
maybe not handled yet at current implementation. (Not quite sure about it. Still, the preemption in go assembly codes may break the semantic of go assembler.)
/cc @rsc @aclements @RLH @josharian
It has been nearly 10 days since our first discussion. Maybe we should make a conclusion about it?
@hnes proposals are reviewed on a schedule. And 10 days is not very long; many decisions take much longer to be made.
Thank Josh for letting me know :)
proposals are reviewed on a schedule. And 10 days is not very long; many decisions take much longer to be made.
It is pretty reasonable.
Certainly loop preemption in Go 1.10 should make this less necessary, but as noted not completely unnecessary. In addition to straight line code there is also the question of large memmoves, which are in non-preemptible assembly, and maybe other operations like those.
The question is how to expose this in the tool. "go tool trace" is really about the web view. This new flag turns it into a command-line program with a new output format. Can we instead somehow highlight "greedy" goroutines in the web view? Then no new option is needed. Or maybe it should be a third-party separate tool anyway?
@dvyukov, as owner of go tool trace, any thoughts?
Perhaps if we trace when we requested STW/foreachP and when we actually get STW/foreachP, it will give enough info in the web view to understand the bad cases (what's running at the time of the request for for how long).
Perhaps if we trace when we requested STW/foreachP and when we actually get STW/foreachP, it will give enough info in the web view to understand the bad cases (what's running at the time of the request for for how long).
But the sampling could be imprecise and still "need some good luck" for users to find the real top N greedy goroutines.
Traversing the whole events[]
list and use the min-heap sort algorithm to find the precise Top-N greedy goroutines, this kind of approach's worst time complexity is only about O(n*lgn) (it's much faster than parsing the trace.out
binary file in practical).
How do you propose to detect non-cooperative goroutines? The fact that it runs on CPU for a long time does not mean anything, it still can be perfectly cooperative if/when we try to preempt it.
You understand that this is a wrong way to write Go code, right?
do_cpu_staff0() -- time cost 200us
runtime.Gosched()
do_cpu_staff1() -- time cost 200us
runtime.Gosched()
do_cpu_staff2() -- time cost 200us
runtime.Gosched()
Goroutines are preemptive since Go 1.1 and soon they become even more preemptive.
How do you propose to detect non-cooperative goroutines?
You are right and I quite agree with you.
The STW/foreachP
should record the information too when it found some non-cooperative goroutines who had take too long (say more than 1ms or could be configured by user) to yield the cpu.
The fact that it runs on CPU for a long time does not mean anything
The more goroutines are cpu greedy, the worst the overall latency the application would be.
it still can be perfectly cooperative if/when we try to preempt it.
Yes, but before you chose to "preempt" it, you probably have to know where the problem exactly is, and this is why we may still need the diagnosis result of Top N
greedy gorountines and the precise source line numbers (as we already discussed in this thread, the the for loop preemptive
approach cannot perfectly solves all this kind of problems, think about assembly for one instance.)
You understand that this is a wrong way to write Go code, right? do_cpu_staff0() -- time cost 200us runtime.Gosched() do_cpu_staff1() -- time cost 200us runtime.Gosched() do_cpu_staff2() -- time cost 200us runtime.Gosched() Goroutines are preemptive since Go 1.1 and soon they become even more preemptive.
May have a look about this although it is some kinda hacky but it still could mean something. This piece code in go 1.8.3 the STW will stuck nearly forever.
The preemptive is just relatively.
So, how about let we combine both of them?
The STW/foreachP
records the information when it found some non-cooperative goroutines who takes too long (say more than 1ms or could be configured by user) to yield the cpu when STW/foreachP
wants to stop the world (May even dumps some warning log to stout or to log file when application is running? That would be very useful in the production environment).
The diagnosis of Top N
greedy gorountines we already discussed previously.
The more goroutines are cpu greedy, the worst the overall latency the application would be.
No. What matters is only if a goroutine is not cooperative.
Yes, but before you chose to "preempt" it, you probably have to know where the problem exactly is
No. Runtime preempts all currently running goorutines.
No. What matters is only if a goroutine is not cooperative.
Agree with you. Maybe this is the literal ambiguity, the phrase "cpu greedy goroutines" I used here does imply that it's less cooperative or completely no cooperative (a example is the go function godeadloop0 defined in this proposal code example).
No. Runtime preempts all currently running goorutines.
The more precise is "Go Runtime always try to preempts all currently running gorutines." But if the goroutines do not give the runtime any single chance, the runtime would never preempt even one goroutines successfully, and that is why we want this useful diagnosis tool to optimize our codes :)
the phrase "cpu greedy goroutines" I used here does imply that it's less cooperative or completely no cooperative (a example is the go function godeadloop0 defined in this proposal code example).
So, how do you propose to detect these 'completely non-cooperative' goroutines?
So, how do you propose to detect these 'completely non-cooperative' goroutines?
In fact I had think about this before, and here is maybe a appropriate solution:
t1=12 t2=11 t3=10
| | | |
__________________________________________________
| | | |
---------------------------------------------------> time line
P0 [------go 1------][-go 7-][-----------------][-go 6-][-go 4-]
P1 [-go 3-][-go 4-][------------go 2-----------][-go 1-][-go 2-]
P2 [-go 4-][-go 3-][-go 8-][-------------------][-----go 10-----]
|
|---->---->---->---->---->|
| |
STW start release
STW:
timeout := 1ms
max_total_time := 2s
total_time := 0
set try_to_stw_flag
for _,p = range [P0,P1,P2] {
locked,timecost := mutex_lock_timeout(p,timeout)
total_time+=timecost
for locked is nil {
scan [P0,P1,P2]'s currently running user goroutines' info
dump it out // to trace struct || some log fd
if total_time >= max_total_time{
panic("stw take too long")
}
locked,timecost = mutex_lock_timeout(p,timeout)
total_time+=timecost
}
}
A little correction.
STW:
timeout := 1ms
max_total_time := 2s
total_time := 0
// ++ do not allocate the idle Processors to ready goroutines anymore
set try_to_stw_flag
for _,p = range [P0,P1,P2] {
locked,timecost := mutex_lock_timeout(p,timeout)
total_time+=timecost
for locked is nil {
// ++ may be it is hard to get the details( include func call stack)
// ++ of a running goroutine dynamically?
scan [P0,P1,P2]'s currently running user goroutines' info
dump it out // to trace struct || some log fd
if total_time >= max_total_time{
panic("stw take too long")
}
locked,timecost = mutex_lock_timeout(p,timeout)
total_time+=timecost
}
}
// ++
unset try_to_stw_flag
// stw successfully
It seems that we just need to trace when we requested STW/foreachP and when we actually get STW/foreachP. The rest of info is already in the trace and can be inferred in the trace command (namely, what goroutines were running at the time of the request).
It seems that we just need to trace when we requested STW/foreachP and when we actually get STW/foreachP. The rest of info is already in the trace and can be inferred in the trace command (namely, what goroutines were running at the time of the request).
Bravo! And it would be much convinient to implement our new tool :D
OK, so it sounds like we should add that trace info to the trace dump file. But I still don't see what we're going to do to expose it, unless the STW is a different clearly marked bar already. What will it look like? A special command-line mode people need to know about is not discoverable enough.
We currently record mark termination STW, but somehow forgot about sweep termination STW. I've got a CL to add this for computing MMUs, but it would help with this analysis, too.
Change https://golang.org/cl/55411 mentions this issue: runtime,cmd/trace: trace GC STW events
OK, so after CL 55411, @aclements, is there anything else for this issue? Should we at least mark it proposal-accepted?
OK, so after CL 55411, @aclements, is there anything else for this issue?
In terms of infrastructure support, I believe CL 55411 is sufficient.
Should we at least mark it proposal-accepted?
I certainly hope we can make this analysis simply unnecessary, but it could be useful in the interim and in flushing out bugs. Presentation questions aside, it should be relatively straightforward to implement a robust version on top of the events added by CL 55411.
Marking Proposal-Hold until there's a clearer explanation of what the exposed UI is.
(Maybe helpful to bypass the issue #10958 in a quite different but probably better and more clean way. Coming from this thread, many thanks to the participants of this discussion)
Hi all :)
I have implemented a sub-option in the
go tool trace
recently named "diagreedy" which means "diagnoses and finds out the greediest several goroutines". This tool already helped us tracked down several deep hidden problems in our go applications and achieved more stable and short GC pause latency by fixing them afterwards.Terminology
Big Picture
In this diagram, the value of N is 3 (i.e. the GOMAXPROCS), and the top 3 long on-processor time slices is t1, t2, t3 and they are corresponding to goroutine 1, 2, 10.
The max time cost of the pre-GC-action Stop-the-World is named Tmax_stw and is the sum of t1, t2 and t3.
The Tgcwork is the time cost by the acually effective Garbage Collecting action after the phase Stop-the-World. Tgcwork is determined by the amount of the allocated objects and the complexity of the relationship among them.
Tmax_stw and Tgcwork together finally determined the one typical GC pause cost. As for Tgcwork, we could optimize it by reusing the objects (via sync.Pool for example) and by many other methods. But we do not discuss Tgcwork futher here. We concentrate on the Tmax_stw right now.
We could get this conclusion plainly:
For some cpu intensive applications, the top max N on-processor time slices is often very big (if without very carefully task splits). Especially when our project is including tons of other and heavily used open source projects which will cause it becomes more complex and harder to trace.
Although we already had some tools to sample out the top n cpu usage of the functions. But the accumulated cpu usage of the functions and the longest top n time slices is not the same thing at all. For example, if we have 1000 different cpu intensive goroutines something like:
Our Tmax_stw would become GOMAXPROCS*1 ms. It's terrible because sometimes we would set the GOMAXPROCS to a pretty big value (16 or even 36 for instance). With the sample of top n cpu usage, we may couldn't find out the sources of the problem easily. But with the 'top max N on-processor time slices' approach, we could track down the sources of the long GC pause immediately and fix them like this:
Then our Tmax_stw would turns from GOMAXPROCS*1 ms to GOMAXPROCS/5 ms afterwards :) (If your GOMAXPROCS is very big, just splits do_cpu_staff even further. And you even could implement some logic which would let the do_cpu_staff automated splits its tasks based on the value of GOMAXPROCS and a other parameter could named time_cost_wanted to solve this problem for good.)
revision1
Sorry, I might had make a mistake:
It should probably be:
Although I hadn't dive in the go scheduler's code yet, but I guess the 2nd is most likely the right answer, because the 1st algorithm is just too inefficient and unlikely using by the go scheduler.
But no matter which one is the right answer, the 1st one or the 2nd, it couldn't change our conclusion made before utterly.
Example
And there is a example about the usage of
go tool trace -diagreedy
Sample code to diagnose:
Diagnosis for the 1st time:
And then, we could found the source of the long GC pause godeadloop0 and chage the variable granul to value 10000. Test again:
As you see, the typical GC pause finally changed from 8ms to 0.4ms. Of course the situation would be much more complex in a real big project, but the key idea of the tracking is the same and clear.
Also implemented a 'dump' option like this below:
I would be very glad to push it to golang project and benefits more if this proposal is accepted.
Many thanks for your time and good patience.