Open mknyszek opened 1 year ago
Attendees: @mknyszek @aclements @prattmic @felixge @nsrip-dd @rhysh
Notes:
I'll miss the Dec 22nd meetup because I'm traveling for the holidays. That being said, if I find time I might also look into https://github.com/golang/go/issues/57159 . Getting a proof of concept for Perfetto UI integration (ideally using their protocol buffer format) is probably more important than the gentraceback refactoring at this point. I just tried to work with a 300 MB (15s of prod activity) yesterday, and it was a real eye opener to the way the current UI struggles.
I don't know if it's relevant (probably nothing new for the folks on this thread), but I had similar problems with the go tool trace
viewer where it would freeze on me all the time, esp. in the per-goroutine view (/trace?goid=N
). I figured out you can download perfetto-compatible JSON data from /jsontrace?goid=N
. (/jsontrace
gives the default view). This can then be uploaded to ui.perfetto.dev. This doesn't show all the information in the trace so it's not as great, but I was glad to have something that worked.
would the pprof labels also show up in goroutine traces?
I'm working on a PoC that improves native stack unwinding on Windows by adding additional information to the PE file. This will help debugging with WinDbg and profiling with Windows Performance Analyzer. Would this work fit into the effort tracked by this issue?
@thediveo I think that might be a good question for #56295, or you could file another issue. Off the top of my head, that doesn't sound like it would be too difficult to do.
@qmuntal Oh neat! That's awesome. I think it's a little tangential to the work we're proposing here, unless you also plan to do anything with the runtime's unwinder (i.e. gentraceback
). Then again, if one of the goals is better integration with the Windows Performance Analyzer that's certainly more in the same spirit. Do you have an issue for tracking that already?
Do you have an issue for tracking that already?
I still have to prepare the proposal, I plan to submit it next week.
unless you also plan to do anything with the runtime's unwinder (i.e. gentraceback).
Not for now, but once I finish this I want to investigate how feasible is too unwind native code and merge it with the Go unwinding, in case the exception happens in a non-Go module.
Do you have an issue for tracking that already?
I do now #57302 😄
Change https://go.dev/cl/459095 mentions this issue: sweet: add support for execution traces and measuring trace overhead
Attendees: @mknyszek @aclements @prattmic @bboreham @rhysh @dominikh
name old time/op new time/op delta
BiogoIgor 17.7s ± 3% 17.5s ± 4% ~ (p=0.190 n=10+10)
BiogoKrishna 15.1s ± 4% 15.1s ± 4% ~ (p=0.739 n=10+10)
BleveIndexBatch100 5.78s ± 7% 5.76s ±11% ~ (p=0.853 n=10+10)
BleveQuery 2.37s ± 0% 2.37s ± 0% -0.26% (p=0.016 n=8+10)
FoglemanFauxGLRenderRotateBoat 16.9s ± 9% 16.9s ± 7% ~ (p=0.796 n=10+10)
FoglemanPathTraceRenderGopherIter1 36.7s ± 1% 44.4s ± 2% +21.01% (p=0.000 n=10+10)
GoBuildKubelet 47.0s ± 2% 48.8s ± 3% +3.72% (p=0.000 n=10+10)
GoBuildKubeletLink 8.89s ± 2% 8.88s ± 4% ~ (p=0.720 n=10+9)
GoBuildIstioctl 45.9s ± 1% 47.8s ± 2% +4.09% (p=0.000 n=10+10)
GoBuildIstioctlLink 9.07s ± 2% 8.99s ± 2% ~ (p=0.095 n=10+9)
GoBuildFrontend 15.7s ± 4% 16.1s ± 2% +2.45% (p=0.043 n=10+10)
GoBuildFrontendLink 1.38s ± 2% 1.37s ± 3% ~ (p=0.529 n=10+10)
GopherLuaKNucleotide 27.9s ± 0% 27.9s ± 1% ~ (p=0.853 n=10+10)
MarkdownRenderXHTML 256ms ± 2% 256ms ± 2% ~ (p=1.000 n=9+9)
Tile38WithinCircle100kmRequest 618µs ± 7% 657µs ±10% +6.30% (p=0.015 n=10+10)
Tile38IntersectsCircle100kmRequest 722µs ± 6% 773µs ± 4% +6.96% (p=0.000 n=10+9)
Tile38KNearestLimit100Request 508µs ± 3% 532µs ± 3% +4.73% (p=0.000 n=10+10)
name old average-RSS-bytes new average-RSS-bytes delta BiogoIgor 68.8MB ± 2% 71.8MB ± 4% +4.40% (p=0.000 n=10+10) BiogoKrishna 4.42GB ± 0% 4.42GB ± 0% ~ (p=0.739 n=10+10) BleveIndexBatch100 194MB ± 2% 198MB ± 3% +1.91% (p=0.008 n=9+10) BleveQuery 536MB ± 0% 537MB ± 1% ~ (p=0.190 n=10+10) FoglemanFauxGLRenderRotateBoat 444MB ± 1% 446MB ± 0% +0.41% (p=0.035 n=10+9) FoglemanPathTraceRenderGopherIter1 132MB ± 1% 142MB ± 4% +7.61% (p=0.000 n=10+10) GoBuildKubelet 1.75GB ± 1% 1.85GB ± 1% +5.51% (p=0.000 n=10+10) GoBuildIstioctl 1.35GB ± 1% 1.42GB ± 1% +5.49% (p=0.000 n=10+9) GoBuildFrontend 511MB ± 2% 543MB ± 1% +6.31% (p=0.000 n=10+9) GopherLuaKNucleotide 37.0MB ± 1% 40.4MB ± 2% +9.24% (p=0.000 n=9+10) MarkdownRenderXHTML 21.8MB ± 3% 24.0MB ± 3% +10.14% (p=0.000 n=9+8) Tile38WithinCircle100kmRequest 5.40GB ± 1% 5.38GB ± 1% ~ (p=0.315 n=10+10) Tile38IntersectsCircle100kmRequest 5.72GB ± 1% 5.71GB ± 1% ~ (p=0.971 n=10+10) Tile38KNearestLimit100Request 7.26GB ± 0% 7.25GB ± 0% ~ (p=0.739 n=10+10)
name old peak-RSS-bytes new peak-RSS-bytes delta BiogoIgor 95.9MB ± 4% 98.5MB ± 3% +2.70% (p=0.030 n=10+10) BiogoKrishna 4.49GB ± 0% 4.49GB ± 0% ~ (p=0.356 n=9+10) BleveIndexBatch100 282MB ± 3% 284MB ± 4% ~ (p=0.436 n=10+10) BleveQuery 537MB ± 0% 538MB ± 1% ~ (p=0.579 n=10+10) FoglemanFauxGLRenderRotateBoat 485MB ± 1% 483MB ± 0% ~ (p=0.388 n=10+9) FoglemanPathTraceRenderGopherIter1 180MB ± 2% 193MB ± 3% +7.19% (p=0.000 n=10+10) GopherLuaKNucleotide 39.8MB ± 3% 46.0MB ±20% +15.56% (p=0.000 n=9+10) MarkdownRenderXHTML 22.1MB ± 3% 25.5MB ± 7% +15.45% (p=0.000 n=9+10) Tile38WithinCircle100kmRequest 5.70GB ± 1% 5.68GB ± 1% -0.45% (p=0.023 n=10+10) Tile38IntersectsCircle100kmRequest 5.93GB ± 1% 5.91GB ± 2% ~ (p=0.631 n=10+10) Tile38KNearestLimit100Request 7.47GB ± 1% 7.46GB ± 0% ~ (p=0.579 n=10+10)
name old peak-VM-bytes new peak-VM-bytes delta BiogoIgor 802MB ± 0% 803MB ± 0% +0.11% (p=0.000 n=10+10) BiogoKrishna 5.24GB ± 0% 5.24GB ± 0% +0.01% (p=0.001 n=10+10) BleveIndexBatch100 1.79GB ± 0% 1.79GB ± 0% +0.05% (p=0.000 n=8+8) BleveQuery 3.53GB ± 1% 3.53GB ± 1% ~ (p=0.237 n=10+10) FoglemanFauxGLRenderRotateBoat 1.21GB ± 0% 1.16GB ± 4% ~ (p=0.163 n=8+10) FoglemanPathTraceRenderGopherIter1 875MB ± 0% 884MB ± 0% +1.02% (p=0.000 n=10+10) GopherLuaKNucleotide 733MB ± 0% 734MB ± 0% +0.11% (p=0.000 n=9+10) MarkdownRenderXHTML 733MB ± 0% 734MB ± 0% +0.10% (p=0.000 n=10+9) Tile38WithinCircle100kmRequest 6.42GB ± 0% 6.39GB ± 1% ~ (p=0.086 n=8+10) Tile38IntersectsCircle100kmRequest 6.62GB ± 1% 6.61GB ± 2% ~ (p=0.927 n=10+10) Tile38KNearestLimit100Request 8.16GB ± 1% 8.18GB ± 0% ~ (p=0.649 n=10+8)
name old p50-latency-ns new p50-latency-ns delta Tile38WithinCircle100kmRequest 144k ± 3% 159k ± 3% +10.56% (p=0.000 n=9+9) Tile38IntersectsCircle100kmRequest 215k ± 1% 232k ± 2% +7.91% (p=0.000 n=9+10) Tile38KNearestLimit100Request 347k ± 2% 373k ± 1% +7.21% (p=0.000 n=10+10)
name old p90-latency-ns new p90-latency-ns delta Tile38WithinCircle100kmRequest 908k ± 6% 956k ± 9% +5.22% (p=0.043 n=10+10) Tile38IntersectsCircle100kmRequest 1.07M ± 4% 1.11M ± 5% +4.33% (p=0.001 n=10+10) Tile38KNearestLimit100Request 1.03M ± 3% 1.05M ± 4% +2.64% (p=0.011 n=10+10)
name old p99-latency-ns new p99-latency-ns delta Tile38WithinCircle100kmRequest 7.55M ± 9% 7.93M ±13% ~ (p=0.089 n=10+10) Tile38IntersectsCircle100kmRequest 7.81M ± 8% 8.39M ± 2% +7.36% (p=0.000 n=10+8) Tile38KNearestLimit100Request 2.03M ± 4% 2.08M ± 5% +2.52% (p=0.019 n=10+10)
name old ops/s new ops/s delta Tile38WithinCircle100kmRequest 9.73k ± 7% 9.16k ±11% -5.83% (p=0.015 n=10+10) Tile38IntersectsCircle100kmRequest 8.31k ± 6% 7.77k ± 4% -6.55% (p=0.000 n=10+9) Tile38KNearestLimit100Request 11.8k ± 3% 11.3k ± 3% -4.51% (p=0.000 n=10+10)
* Introduction: Bryan Boreham, Grafana Labs
* Questions within the team about whether useful information has been derived from Go execution traces.
* Phlare: continuous profiling. Interested in linking together various signals (distributed tracing, profiling)
* Michael K: Interesting data point about usability.
* Michael P: Hard to link application behavior to trace.
* Bryan: Example: channels. Still don't really know where to find that data.
* Dominik: One of the reasons I started on gotraceui was to surface more information and do more automatic inference and analysis of the data.
* Rhys: Execution trace technique: get data out of them to find the interesting traces. Try to extract features that would be interesting up-front.
* Starts with internal trace parser. Have code to find start and end of HTTP requests, DNS lookups, etc.
* Tooling on the way to get open sourced.
* Heap analysis plan (#57447)
* Austin: Additional context is we're confident in the API we're planning to export, as opposed to tracing which we have nothing for yet.
* https://go.dev/issue/57307 proposal: cmd/trace: visualize time taken by syscall
* Austin: Does Perfetto do better with instantaneous events?
* Michael P: Yes, there's a 20px wide arrow but we have so many.
* Rhys: Hold shift, draw a box. If you aim well, you get what you want.
* Rhys: Why is there only one timestamp on some events?
* Austin: We can add another timestamp.
* Michael P: Syscall fast path does a lot less.
* pprof labels in traces
* Michael K: I think I've unblocked Nick. Michael and I are reviewing.
* `runtime.gentraceback` cleanup
* Austin: Back and forth on the issue about making it an iterator, sent out CLs, not tested yet.
* Next meeting: Jan 5th, Michael P and Michael K won't be here, so Austin will run it.
* Action items:
* We're slowing down for the holidays, so no strong expectations
* Michael K:
* Try to land execution trace benchmarking.
* Might look into heap analysis stuff.
* After break, might want to start working on trace format more seriously.
* Happy holidays!
Attendees: @aclements @felixge @nsrip-dd @rhysh @bboreham vnedkov @dashpole
Attendees: @aclements @felixge @nsrip-dd @rhysh @bboreham @mknyszek @prattmic @dominikh @dashpole
go tool trace
(bugs). e.g. timelines not named correctly. Events not connected correctly.
Attendees: @aclements @felixge @nsrip-dd @thepudds @bboreham @dashpole @mknyszek @prattmic
Felix: The prototype is missing inline expansion, support for SetCgoTraceback (Go -> C -> Go), and dragons in the compiler where the FP isn't on the stack when it should be. Previous implementation hit this and I suspect I hit this as well.
FYI: #57302 is hitting this as well, as I'm implementing SEH unwinding using the frame pointer. Whichever is the fix for that, would be good to take SEH also into account.
Attendees: @mknyszek @aclements @felixge @nsrip-dd @prattmic @dominikh @thepudds @pmbauer @dashpole @rhysh
$ benchstat -col '.name@(CPUTicks Nanotime)' /tmp/bench
goos: linux
goarch: amd64
pkg: runtime
cpu: 11th Gen Intel(R) Core(TM) i7-1185G7 @ 3.00GHz
│ CPUTicks │ Nanotime │
│ sec/op │ sec/op vs base │
*-8 10.75n ± 0% 16.11n ± 0% +49.88% (p=0.000 n=20)
Attendees: @mknyszek @prattmic @felixge @nsrip-dd @aclements @thepudds @rhysh @bboreham
Michael K: I'm 70% of the way to a trace v2 (producer, consumer, trace format), and 40% of the way to writing it up.
[Michael K proceeds to go into way too much detail about this. Highlights below. A public document will follow.]
Let's use the system clock (e.g. clock_gettime
) instead of RDTSC
(for a number of reasons).
There are a very small number of places where you really need to understand the exact order of events. The current tracer takes advantage of that and I believe we need to retain this. Timestamps aren't enough.
Attach traces to Ms, not Ps. There’s a lot of complexity around GoSysExit
racing with trace start. Thinking about ragged start and making the parser robust to that.
Trace binary format ended up being more about consumer efficiency than producer efficiency, but still more efficient on both sides.
/proc/<pid>/maps
for binary/symbol information./proc/<pid>/maps
.Traces will be partitioned for streaming. Each partition is fully self-contained with a set of stacks and strings
Does this include the current state of all (relevant) goroutines? The current parser is essentially a state machine and we need to see all previous events to reconstruct a global timeline. I don't see that going away with the new format.
Michael K: I have also have a rougher draft of a trace parser API, with input from Michael Pratt.
I'd encourage you to take a look at https://github.com/dominikh/gotraceui/blob/04107aeaa72e30c50bb6d10e9f2b6ca384fafc3d/trace/parser.go#L18-L77 for the data layout I've chosen in gotraceui. It's nothing groundbreaking, but it highlights the need to avoid the use of pointers.
Traces will be partitioned for streaming. Each partition is fully self-contained with a set of stacks and strings
Does this include the current state of all (relevant) goroutines? The current parser is essentially a state machine and we need to see all previous events to reconstruct a global timeline. I don't see that going away with the new format.
It does not. It only cares about the initial state of all Ms (including goroutines running on them), and generally only mentions goroutines that actually emit events. For goroutines that aren't running, there are only two cases where we actually care about the initial state of a goroutine: whether it was blocked, or whether it was waiting. In both cases it's straightforward to infer the state of the goroutine from the events that must happen to transition goroutines out of these states: unblocking and starting to run.
The trace still needs to indicate if a goroutine (and M) is in a syscall or if it's running. In the new design, this information is emitted together at the first call into the tracer by that M for that partition. The timestamp needs to be back-dated to the start of the partition. There's some imprecision with this back-dating but it's only relevant at the very start of a trace. The worst case is that a goroutine may appear to have been running or in a syscall at the start of a trace for longer than it actually was. The amount of imprecision here is bounded by the time delta between the global (serialized) declaration of a new partition and when an M has it's buffer flushed and/or is notified (via an atomic) that tracing has started, which I expect in general to be very short and non-blocking. (We can also explicitly bound the time by telling the M what time it was contacted for a new partition.)
Note that the details above imply that when a new partition starts, a running M may have been in a tight loop and so hasn't emitted any events for the last partition, in which case we need to preempt it to have it dump its initial state. Generally, moving partitions forward doesn't have to even involve preemption.
Michael K: I have also have a rougher draft of a trace parser API, with input from Michael Pratt.
I'd encourage you to take a look at https://github.com/dominikh/gotraceui/blob/04107aeaa72e30c50bb6d10e9f2b6ca384fafc3d/trace/parser.go#L18-L77 for the data layout I've chosen in gotraceui. It's nothing groundbreaking, but it highlights the need to avoid the use of pointers.
That seems useful for the current trace format, thanks. For the new format, I don't expect to expand the trace events out of their encoded form at all, but rather decode them lazily (either copy them out wholesale or just point into the encoded trace data in the input buffer, both of which are cheap from the perspective of the GC).
In both cases it's straightforward to infer the state of the goroutine from the events that must happen to transition goroutines out of these states: unblocking and starting to run.
That has two implications, however:
I realize that with self-contained partitions it isn't feasible to include the state of all goroutines in all partitions, but maybe it should optionally be possible to dump complete state in the first partition, for users who want a complete view? However that wouldn't really fit into an M-centric format…
That seems useful for the current trace format, thanks. For the new format, I don't expect to expand the trace events out of their encoded form at all, but rather decode them lazily (either copy them out wholesale or just point into the encoded trace data in the input buffer, both of which are cheap from the perspective of the GC).
I feel like the current parser + its types and the new approach you describe are at two different layers of abstraction. The current parser isn't exposing raw events. Instead it is doing a fair bit of processing of arguments, and it populates Link
fields, which point to related events. Your approach sounds a lot closer to just casting from []byte to a type describing the raw events. And there'll still need to be a layer of abstraction on top of that that can be consumed by users (unless you expect them to build their own, which would work for me, but be a barrier to entry for people less familiar with the underlying file format.)
That has two implications, however:
- goroutines that don't unblock during the trace will be unaccounted for
- the states of all goroutines can't be determined without looking at the entire trace
I realize that with self-contained partitions it isn't feasible to include the state of all goroutines in all partitions, but maybe it should optionally be possible to dump complete state in the first partition, for users who want a complete view?
Both of those things are good points.
Dumping the state of the world at the start is one option but I'm also reluctant to do anything around this because it adds a lot of overhead. Interrogating every goroutine can take a while, and the world needs to be effectively stopped while it happens (or the synchronization will get really complicated). At the end of the day, my gut feeling is that the execution trace should focus solely on what's necessary for tracing execution, not what could execute.
However, I can definitely see that getting the information you describe has utility and we don't want to lose that. In the last meeting we discussed how goroutine profiles could be used to fill this gap. As a baseline, it should be fairly straightforward to correlate a goroutine profile's STW timestamp with a STW event in the trace. Taking that one step further, we could explicitly mention that the STW was for a goroutine profile in the trace. (In theory we could also dump the goroutine profile into the trace, like we do with CPU samples. I am not opposed to this, but I probably wouldn't do it to start with.)
You should be able to get a close approximation to the current behavior by starting a trace and then immediately grabbing a goroutine profile. Does that sound reasonable? Perhaps I'm missing some use-case that's totally missed. FTR, I fully recognize that we're losing something here in the trace, but I argue the net benefit is worth that cost.
Also I just want to disclaim the design details in the last paragraph: subject to change in the first document draft. :) That's just where my head's at right now. It may turn out that the per-M synchronization I have in mind is too complex.
However that wouldn't really fit into an M-centric format…
I think it works fine if, like I mention above, we're willing to give a little bit of leeway. Maybe you don't have a snapshot of the state of all goroutines at the moment the trace starts, but you have one from very soon after the trace starts, which is probably good enough?
I feel like the current parser + its types and the new approach you describe are at two different layers of abstraction. The current parser isn't exposing raw events. Instead it is doing a fair bit of processing of arguments, and it populates
Link
fields, which point to related events. Your approach sounds a lot closer to just casting from []byte to a type describing the raw events. And there'll still need to be a layer of abstraction on top of that that can be consumed by users (unless you expect them to build their own, which would work for me, but be a barrier to entry for people less familiar with the underlying file format.)
That's another good point. To be clear, I do plan to have an API with some level of abstraction and not quite just []byte-to-type. :) Events will be opaque and fields will be accessed through methods, so we have a lot of wiggle room. However, something like the Link
field I think requires keeping the whole trace in memory, because you never know when someone might want to access an event from a long long time ago (though I haven't thought this through). In theory an accessor can be arbitrarily complicated and even re-parse the trace to find the event, I suppose. :P
My general hope and expectation is that the vast majority of users should never have to look at the API at all, and instead rely on tools built with it. And those that do use the API don't need to understand the file format, just the execution model it presents (which I think is somewhat unavoidable).
Dumping the state of the world at the start is one option but I'm also reluctant to do anything around this because it adds a lot of overhead. Interrogating every goroutine can take a while, and the world needs to be effectively stopped while it happens (or the synchronization will get really complicated).
I think not having to STW and enumerate all goroutines was one of the design goals, as it didn't scale well. I take it the ragged barrier approach didn't pan out?
At the end of the day, my gut feeling is that the execution trace should focus solely on what's necessary for tracing execution, not what could execute.
One use case of looking at execution traces as they are now is debugging synchronization issues. Imagine having an N:M producer/consumer model using goroutines and channels, and we're debugging why producers are blocking. The reason might be that all of the consumers are stuck, which is only evident if we can see them be stuck. If they're already stuck at the beginning of the trace then they would be invisible in the new implementation.
More generally speaking, a lot of users aren't interested in the per-P or per-M views and instead want to see what each goroutine is doing (see also the per-goroutine timelines in gotraceui.) It turns out that per-G views are useful for debugging correctness and performance issues in user code and that traces aren't only useful for debugging the runtime.
You should be able to get a close approximation to the current behavior by starting a trace and then immediately grabbing a goroutine profile. Does that sound reasonable?
In theory that sounds fine, assuming goroutine profiles are proper STW snapshots? Otherwise it would probably be difficult to synchronize the trace and the profile.
At least this would give people the choice if they want to tolerate STW for more detailed traces.
However that wouldn't really fit into an M-centric format…
I think it works fine if, like I mention above, we're willing to give a little bit of leeway. Maybe you don't have a snapshot of the state of all goroutines at the moment the trace starts, but you have one from very soon after the trace starts, which is probably good enough?
Probably, yeah.
I think not having to STW and enumerate all goroutines was one of the design goals, as it didn't scale well. I take it the ragged barrier approach didn't pan out?
It's not quite that it didn't pan out and more that it just doesn't work with a per-M approach given other design constraints.
The ragged barrier I mentioned in an earlier design sketch is the forEachP
one which is ultimately still P-focused. Part of the reason I want to switch to a per-M approach is to remove the GoSysExit
complexity that comes from the fact that goroutines can in fact run without Ps sometimes. That complexity is part of the event's semantics, so it tends to leak everywhere.
A per-M approach can side-step a lot of that complexity, but it means we need a way to synchronize all Ms that doesn't involve waiting until the M gets back into the scheduler. What I wrote above is a rough sketch of a proposed lightweight synchronization mechanism that most of the time doesn't require preemption. I think that in general we can't require preemption in a per-M approach if we want to be able to simplify the no-P edge cases and also get events out of e.g. sysmon
, which always runs without a P. (EDIT: D'oh. I keep forgetting that the current tracer can indeed emit events without a P. So that's really more that just that we don't currently have a great way of tracking Ms in general. I would like to add more explicit M-related events. The GoSysExit
point still stands because it races with a trace stop-the-world, which is the main source of complexity. If we synchronize via Ms that goes away.)
(In effect, I am proposing to shift the GoSysExit
complexity somewhere else, but I hope that in the end it will be less complexity overall because the M synchronization details can probably be written in a way such that the details don't leak as much.)
An aside that might steer you closer to a per-M approach: I tried adding per-M timelines to gotraceui using the current format and found it impossible due to the current event sorting logic. I ran into scenarios where a P would start on an M while the M was still blocked in a syscall.
In theory that sounds fine, assuming goroutine profiles are proper STW snapshots? Otherwise it would probably be difficult to synchronize the trace and the profile.
Yes, goroutine profiles are STW snapshots, but the duration of the STW pause does not vary based on the number of goroutines. Go 1.19 includes https://go.dev/cl/387415, which says "... do only a fixed amount of bookkeeping while the world is stopped. Install a barrier so the scheduler confirms that a goroutine appears in the profile, with its stack recorded exactly as it was during the stop-the-world pause, before it allows that goroutine to execute."
We'd want something like a "STW" event in the execution trace that we could tie back to the particular goroutine profile (maybe record the goroutine ID of the initiator and either a call stack or the reason
string); although a protobuf-formatted goroutine profile will include a timestamp, it's determined by the runtime/pprof package after all of the stacks have been collected, rather than by the runtime package during the STW that the snapshot represents.
goroutines can in fact run without Ps sometimes
This goes for g0 as well: CPUSample events include the G and P (when they're available), but an M can run without either, such as when "spinning" near the bottom of findRunnable
. That event should have included a reference to the M. As it is, it's tricky to attribute the on-CPU cost of spinning Ms, and to find the application behaviors that result in higher/lower costs there.
Having a tracing buffer available to the M could also simplify the way we get information from the SIGPROF handler into the trace.
Attendees: @mknyszek @felixge @bboreham @nsrip-dd @dominikh @rhysh @thepudds
Attendees: @prattmic @mknyszek @felixge @nsrip-dd @bboreham @thepudds
Attendees: @nsrip-dd @bboreham @thepudds @prattmic @mknyszek
name time/op
Varint-48 13.9ns ± 0%
Word-48 3.20ns ± 1%
@mknyszek @thepudds interesting overhead discussion ❤️! FWIW I don't think we'll be able to get overhead < 1% for pathological workloads (e.g. channel ping pong). Our internal target is < 1% overhead for 95% of applications. From what I've seen, I think we'll hit that with frame pointer unwinding.
@mknyszek thanks for all the experiments. The varint benchmark sounds promising! But we're still struggling to confirm it as a significant overhead source on our end. Looking forward to discussing it more at the next sync.
Sorry for missing the meeting, I'm on vacation right now 🏖️.
FWIW I don't think we'll be able to get overhead < 1% for pathological workloads (e.g. channel ping pong). Our internal target is < 1% overhead for 95% of applications. From what I've seen, I think we'll hit that with frame pointer unwinding.
I'd like to learn more about what the <1% looks like (is that while the tracer is enabled continuously?). But when you get back. :)
thanks for all the experiments. The varint benchmark sounds promising! But we're still struggling to confirm it as a significant overhead source on our end. Looking forward to discussing it more at the next sync.
That's fair, and I'm considering putting it rest for now and just proceeding with everything else. These experiments were easy to run in the background since I was on rotation last week, but in hindsight I think I should've checked a couple other things first. Specifically, I should've check for whether it's worth changing the encoding solely for the encoder-side costs at all:
nanotime
?I'll run these before the next meeting and I should have the answers I need on whether to pursue this further in the near-term.
Thinking more long-term, LEB128 is well-known for being relatively slow to decode, and I suspect in-process decode performance may become a bottleneck for some use-cases. I think there are still a lot of questions there but I also want to leave open as many doors as possible. I'll add a line to the agenda for next time about this.
(Since we're already overhauling the trace format, I'd like to make sure these kinds of use-cases are considered to avoid making more breaking changes than necessary down the line.)
Sorry for missing the meeting, I'm on vacation right now 🏖️.
No worries at all; enjoy your vacation! :)
I'd like to learn more about what the <1% looks like (is that while the tracer is enabled continuously?). But when you get back. :)
We're still working out the details, but yeah, ideally the overhead is <1% (cpu, memory, latency) for 95% of applications while the tracer is recording.
Thinking more long-term, LEB128 is well-known for being relatively slow to decode, and I suspect in-process decode performance may become a bottleneck for some use-cases.
Yeah, I think there are many good reasons for moving away from LEB128 in the long run. In-process decoding is an excellent reason. If we could find an approach that doesn't increase data volumes or encoding overhead, it'd be an absolute no-brainer. But if that's not possible we just need to be careful that we're hitting a good balance.
Attendees: @mknyszek @thepudds @nsrip-dd @felixge @prattmic @bboreham @cagedmantis @irfansharif
Regarding representative traces, I attach two traces from the Open Source monitoring system Prometheus: prom.trace1.gz prom.trace2.gz
Both cover 5 seconds; the first one is from a small Prometheus and the second from a much bigger instance. Let me know if something longer/bigger/different would help more.
Attendees: @mknyszek @rhysh @bboreham @thepudds @aclements @nsrip-dd @dominikh @prattmic
Attendees: @mknyszek @thepudds @aclements
Before the notes, I'd like to note that this sync barely happened. I showed up 15 minutes late (very sorry about that), though it seems like many of us were unavailable for the sync anyway, so perhaps it was just not meant to be. For anyone that showed up in those first 15 minutes but left confused, I'm very sorry about that; that's a mistake on my part. Austin and I joined late and briefly discussed some topics with thepudds, but the sync did not run very long at all.
Attendees: @mknyszek @prattmic @bboreham @leitzler @nsrip-dd @rhysh @felixge @irfansharif
Attendees: @aclements @mknyszek @prattmic @nsrip-dd @rhysh @felixge
go tool trace
. Does that work?go tool pprof
does have a -raw
flag (and a -traces
flag?) for producing a text format.go tool trace
UI it would currently be really onerous, even for little things like just the latency histograms. Text format would help.Attendees: @mknyszek @aclements @prattmic @felixge @nsrip-dd @bboreham @rhysh
comparable
).I'm leaving this here as it's somewhat related and possibly of interest: user_events: Enable user processes to create and write to trace events
@mknyszek regarding our earlier conversation about EvGoSysCall and cgo calls emitting those events, I've noticed that in Go 1.21, calls to runtime.gcStart
also emit EvGoSysCall.
Maybe it's best if we clean that up after all and make sure only actual syscalls emit that event. If we later decide that Cgo calls deserve to be included, too, they could get their own event type.
I've noticed that in Go 1.21, calls to
runtime.gcStart
also emit EvGoSysCall.
Huh... Thanks for pointing that out. That's a bit surprising to me; maybe a bug? Can you share the runtime part of the stack trace for the event? I'm curious as to where it's getting emitted. (Sounds benign, but maybe not!)
+1 to cgo calls just having their own event type though. Alternatively the syscall event could being a more general "we're leaving Go" event with a subtype byte (since otherwise we'd have to copy the complexity of the syscall event's sequencing into the cgo events).
runtime.gcStart
/home/dominikh/prj/go/src/runtime/mgc.go:667
runtime.mallocgc
/home/dominikh/prj/go/src/runtime/malloc.go:1242
runtime.newobject
/home/dominikh/prj/go/src/runtime/malloc.go:1324
runtime/trace.StartRegion
/home/dominikh/prj/go/src/runtime/trace/annotation.go:158
main.fn1
/home/dominikh/prj/src/example.com/foo.go:20
Where foo.go:20 is rg := trace.StartRegion(context.Background(), "foo region")
.
The Go version was go version devel go1.21-e6ec2a34dc Thu Jul 6 20:35:47 2023 +0000 linux/amd64
Thanks! The source of that is the notetsleepg
in gcBgMarkStartWorkers
which calls entersyscallblock
. This case is kind of interesting, because it is a blocking syscall (the thread blocks in FUTEX_WAIT
but the goroutine gets descheduled and its P is released). It's just one invoked by the runtime itself.
I'd argue maybe this one is working as intended? I see two possible problems here:
syscall
package, but it's not right for this case. This stack trace makes it appear as if the GC somehow represents a syscall but it's just a side-effect.Maybe we capture more stack frames? That makes noisiness potentially worse, but in this case I think it would help a lot to identify "OK it's just the GC spinning up workers and waiting for them to start" from the extra stack frames. TBH hiding the extra stack frames and/or reducing noise from runtime-specific information seems more like a presentation problem and I'd be inclined to hide fewer frames in the trace itself (and leave it up to the UI to hide things), but I'm biased since those frames are ~always useful to me personally. :)
(From a performance perspective, I doubt including those frames would hurt much. Stack traces are likely to be just as dedup-able and the extra few frames would only increase trace size slightly.)
Based on your analysis I agree that this is working as intended. I'm also in favor of capturing more stack frames. With a proper stack, this event might even be useful to users, since it's a possible source of latency. And if it's too noisy, the UI should allow filtering it away.
TBH hiding the extra stack frames and/or reducing noise from runtime-specific information seems more like a presentation problem
It wouldn't be the first time that Gotraceui analyses stack traces, either. I already do so to walk backwards out of the runtime for events such as blocking channel sends (the user doesn't care that this happens in runtime.chansend1).
Attendees: @mknyszek @prattmic @rhysh @bboreham @dominikh
Just a small comment,
Rhys: One problem is that they look like they're reentrant until they're not. A useful tool would be to identify if you're assuming the lock is reentrant.
https://github.com/sasha-s/go-deadlock does this (at runtime) by recording and comparing the stack frames seen for each (manually instrumented) mutex (very expensive, we only use it in nightly stress testing). Of course this requires observing a re-entrant lock acquisition. It seems difficult to do this sort of thing statically, but even doing it cheaply enough to have it always on would be helpful.
@tbg We discussed detection briefly. It is unclear to me how to do robust deadlock detection given that lock and unlock are not required to be on the same goroutine. e.g., consider:
Step 3 naively looks like a deadlock, but as long as goroutine 2 is guaranteed to make progress it is not. How does go-deadlock handle this situation?
It gives a false positive:
It's difficult to handle locks that pass between goroutines. Which is a shame, because it's also pretty rare.
Attendees: @mknyszek @rhysh @nsrip-dd @felixge @bboreham @dominikh
Attendees: @mknyszek @aclements @felixge @nsrip-dd @rhysh @dominikh
Attendees: @mknyszek @rhysh @bboreham
As the Go user base grows, more and more Go developers are seeking to understand the performance of their programs and reduce resource costs. However, they are locked into the relatively limited diagnostic tools we provide today. Some teams build their own tools, but right now that requires a large investment. This issue extends to the Go team as well, where we often put significant effort into ad-hoc performance tooling to analyze the performance of Go itself.
This issue is a tracking issue for improving the state of Go runtime diagnostics and its tooling, focusing primarily on
runtime/trace
traces and heap analysis tooling.To do this work, we the Go team are collaborating with @felixge and @nsrip-dd and with input from others in the Go community. We currently have a virtual sync every 2 weeks (starting 2022-12-07), Thursdays at 11 AM NYC time. Please ping me at
mknyszek -- at -- golang.org
for an invite if you're interested in attending. This issue will be updated regularly with meeting notes from those meetings.Below is what we currently plan to work on and explore, organized by broader effort and roughly prioritized. Note that this almost certainly will change as work progresses and more may be added.
Runtime tracing
Tracing usability
Tracing performance
gentraceback
refactoring (#54466) (CC @felixge, @aclements)Heap analysis (see #57447)
viewcore
's internal core file libraries (gocore
andcore
) to work with Go at tip.gocore
andcore
are well-tested, and tested at tip.gocore
andcore
externally-visible APIs, allowing Go developers to build on top of it.CC @aclements @prattmic @felixge @nsrip-dd @rhysh @dominikh