Open fd opened 8 years ago
/cc @aclements
I'm going to tentatively mark this as a feature request for runtime, instead of a proposal, since it seems pretty uncontroversial to me.
It doesn't seem like a very interesting number, it'll always be less than or equal to GOMAXPROCS.
On Wed, 14 Sep 2016, 00:43 Quentin Smith notifications@github.com wrote:
I'm going to tentatively mark this as a feature request for runtime, instead of a proposal, since it seems pretty uncontroversial to me.
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/golang/go/issues/17089#issuecomment-246704646, or mute the thread https://github.com/notifications/unsubscribe-auth/AAAcA_CxhRkJYUwt9yo_IlXyO0wh4e9Uks5qpraLgaJpZM4J7r8b .
@davecheney The suggestion counts runnable goroutines, so it can be larger than GOMAXPROCS.
My concern is that I don't see that this adds anything very useful over NumGoroutine
. If you are worried about shedding load then I don't see why you want to ignore the system goroutines. And there aren't very many system goroutines anyhow, so if you are in a condition where load shedding is relevant they are just going to be a rounding error.
I think more importantly he doesn't want to count goroutines which are blocked (which NumGoroutine does count).
Oh, I see, but then the goroutines in state _Gsyscall
are ambiguous, as they could be blocked.
Indeed. Only things blocked in Go (select{}, ...) would not be counted if we used the raw goroutine states.
@ianlancetaylor: The suggestion counts runnable goroutines, so it can be larger than GOMAXPROCS.
That is correct.
@ianlancetaylor: My concern is that I don't see that this adds anything very useful over
NumGoroutine
.
Using NumGoroutine
breaks down when you have long running goroutines that do background work (like periodically refreshing cache entries). This approach also breaks down for proxy servers as they spend most of their time waiting on the network.
@ianlancetaylor: If you are worried about shedding load then I don't see why you want to ignore the system goroutines.
Based on the POC that I made, It seems that at least some system goroutines always appear active. I could be wrong here as they might be blocking on a syscall (like say the netpoller).
The other reason I think system goroutines should be excluded is because NumGoroutine
also excludes them.
@ianlancetaylor: Oh, I see, but then the goroutines in state
_Gsyscall
are ambiguous, as they could be blocked.
That is correct maybe the g.waitreason
should be taken into account to?
Otherwise the _Gsyscall
state could be excluded.
@randall77: Only things blocked in Go (select{}, ...) would not be counted if we used the raw goroutine states.
How can this be detected?
Here is the POC code I wrote: https://gist.github.com/fd/7136de67a56e174d8c06cb505f7278aa
Goroutines blocked in the Go runtime have Gwaiting state. You probably don't want to count those, they contribute nothing to CPU load (but do consume some memory).
It is not clear whether you should count goroutines in the Gsyscall state. Whether you want to count them depends on whether they are doing real work in the syscall (reading a large file, say) or waiting (read on an idle network socket). I don't think the runtime has the information needed to make that call, although we might be able to make some approximation. That's what makes this problem hard.
So, how about this:
_Gsyscall
goroutines (except for the system goroutines which should be excluded).So unless you are heavily using something like gopkg.in/fsnotify.v1
NumActiveGoroutine
should be a decent approximation of the actual work load.
Including _Gsyscall
should be a good starting point for NumActiveGoroutine
.
The runtime could be extended to record the called syscall in G.
Then syscall package could be extended with a list of syscalls that result in some form of idling.
Given these changes, NumActiveGoroutine
can decide whether to consider the goroutine active or not. Syscalls called from cgo are still hidden in this senario.
Remember, it is not my goal to find an accurate estimation of the CPU utilisation. Instead it is my goal to find a good-enough estimation of the application utilisation. I included a excerpt from Site Reliability Engineering, How Google Runs Production Systems which seems to suggest that Google uses a similar metric/approach.
The utilization signals we use are based on the state local to the task (since the goal of the signals is to protect the task) and we have implementations for various signals. The most generally useful signal is based on the “load” in the process, which is determined using a system we call executor load average .
To find the executor load average, we count the number of active threads in the process. In this case, “active” refers to threads that are currently running or ready to run and waiting for a free processor. We smooth this value with exponential decay and begin rejecting requests as the number of active threads grows beyond the number of processors available to the task. That means that an incoming request that has a very large fan-out (i.e., one that schedules a burst of a very large number of short-lived operations) will cause the load to spike very briefly, but the smoothing will mostly swallow that spike. However, if the operations are not short-lived (i.e., the load increases and remains high for a significant amount of time), the task will start rejecting requests.
Using NumGoroutine breaks down when you have long running goroutines that do background work (like periodically refreshing cache entries). This approach also breaks down for proxy servers as they spend most of their time waiting on the network.
As you say, you are looking for an approximation, and you care about load shedding. Unless you start a long running goroutine for each incoming request, the number of long running goroutines should be a tiny fraction of the total number of goroutines, and are therefore ignorable for approximation purposes.
I agree that proxy servers are a problem.
Since you have proof of concept code, do you have a way to see the difference between NumGoroutine
and NumActiveGoroutine
for a large server?
I would be less concerned about adding NumActiveGoroutines
if it weren't for the ambiguity about _Gsyscall
. I'm worried about how to document what the result really means for programs that call C code. It's probably unusual to call C code that makes direct network calls, but it's not in the least unusual to call C code that uses the file system, which may be networked, or that uses a library that in turn makes DNS lookups or in some other way uses the network. So while NumActiveGoroutines
is easy to understand for pure Go code, I don't see how it's easily generalizable for Go programs that call C code.
One possibility would be to return two numbers: the number of running/runnable goroutines and the number of goroutines waiting for a system call or C code. But that seems to me to be too tied to the current details of how system calls and cgo are implemented.
I assume you are looking for some sort of general framework here, because for any specific program that wants to do load shedding I would say just count the number of active requests.
The problem NumActiveGoroutines is trying to solve is when to shed load. Wouldn't monitoring the latency of an application request be a more direct and ultimately more correct way to do this. If latency increases shed load. If latency improves increase load.
Is there a use case where this doesn't work but NumActiveGoroutines does?
Discussing the nuances of what _Gidle, _Grunnable, _Grunning, _Gsyscall, _Gwaiting plus what _Gscanrunning _Gscanrunnable, _Gscansyscall, and _Gscanidle means in this context is a very implementation dependent discussion.
Even NumGoroutines does not capture all the work C is doing; the C code may have spawned threads that are independently doing work as well.
I think it's reasonable to say that goroutines in C are not active from the perspective of Go, regardless of what they're calling.
This is not uncontroversial.
CL https://golang.org/cl/38180 mentions this issue.
Side-stepping whether or not we want this, but now that we have a runtime metrics API (#37112), if we do add these kinds of metrics, adding to the metric API will be the obvious place rather than a methods.
See also #15490.
cc @mknyszek
Summary
I'd like to propose a way to expose the number of active (running + runnable) goroutines.
Background
My primary use case for this metric is to estimate application load (
num-active-goroutines / num-cpu
) in order to implement load shedding. Other metrics, like thetimes()
syscall, don't expose application overload and don't work well in the presence of noisy neighbours.Plan
Currently the runtime package includes
runtime.NumGoroutine() int
which returns the number of live, non-system goroutines.The runtime package could be extended to include
runtime.NumActiveGoroutine() int
.NumActiveGoroutine()
should count all goroutines whereisSystemGoroutine()
is false and where status is_Grunnable|_Grunning|_Gsyscall
.It seems that such a function would need to acquire
sched.lock
andallglock
. This could have some performance implications.