proposal: x/debug/trace: add package for parsing execution traces

golang / go

The Go programming language

https://go.dev

BSD 3-Clause "New" or "Revised" License

123.46k stars 17.59k forks source link

proposal: x/debug/trace: add package for parsing execution traces #62627

Open mknyszek opened 1 year ago

mknyszek commented 1 year ago

As part of #60773 (tracking issue #57175) I've been working on a new parser for the execution tracer, and to save work down the line I've also been trying to come up with a nice API that would work well.

The goals of such an API would be to:

Enable custom analyses of traces.
Allow for streaming traces and performing on-line analyses.
Be both backward- and forward- compatible with new Go versions and trace wire formats.

Here are the requirements I have for such an API:

The API should be stream-based, regardless of whether the underlying trace format supports it.
Parsing a single trace in parallel is unsupported (part of the point of parsing is to serialize the stream anyway).
The API should expose some model of the Go scheduler explicitly. (This comes from an insight from @rhysh about the internal/trace API that doesn't actually expose e.g. the state machine goroutines go through (even though it already has to know about it), leaving users to figure it out again.)
The API should at least allow for an extension for efficient random access from local storage (e.g. loading from an mmap'd trace file).

Any such API is likely to have a fairly large surface. Rather than try to list the full API and debate its merits, we should experiment.

https://go.dev/cl/527875 is my first attempt at such an experiment to support the work done for #60773. The implementation is not quite fully baked (there may be bugs, and there's still some validation missing) but it already seems to work for non-trivial traces produced by https://go.dev/cl/494187 (with GOEXPERIMENT=exectracer2). (I'll update this section when it's ready, including with instructions on how to try it out.)

I plan to take this API and vendor it and/or copy it into the standard library to work as the implementation for supporting new traces in internal/trace. That'll help move #60773 along but it'll also give us some experience with using the API.

Once I get this API landed in x/exp, and the first cut of the new tracer for #60773, I'd like to invite people to experiment to both shake out bugs but also to identify deficiencies in the API.

For the proposal committee: there's nothing actionable here yet, but I figured I'd create an issue to track work and start a discussion.

Related to #62461.

CC @prattmic @aclements @dominikh @felixge @nsrip-dd @bboreham @rhysh

gopherbot commented 10 months ago

Change https://go.dev/cl/544655 mentions this issue: trace: export trace parsing API from std

mknyszek commented 10 months ago

I uploaded CL of a decent first cut to golang.org/x/exp: https://go.dev/cl/544655.

@felixge and I got some experience using this to port cmd/trace over to the new trace parser, and here are some of my notes so far:

Syscalls should probably not be a state of a goroutine, but rather an EventRange set of events. Correct emission of syscall ranges really depended on treating them identically to EventRange ranges.
Detecting blocking syscalls is annoying if you don't know what to look for (proc goes idle while its goroutine is in a syscall). I think if we make syscalls EventRange events, then we can emit an EventStateTransition of GoRunning->GoWaiting when the goroutine blocks in a syscall. Of course, we'd still have the proc going idle, but the explicit signal would simplify a bunch of analysis.
EventRangeActive events are necessary to avoid loss of information, but using them correctly is a little hard. Arguably though, it's not any harder than dealing with GoUndetermined->GoWhatever transitions since the analysis needs to do the same kind of "this was true since the last sync" accounting. Perhaps just examples and documentation will suffice.
The new tracer plays it a little loose with GOMAXPROCS. It treats it like a sampled metric rather than something with any semantic meaning. Maybe that's OK if we guarantee it shows up in each generation (like it does now), but it would probably be better if we had an explicit signal.
The range event names and state transition reasons are strings, so any analysis that is looking for something very particular will quickly end up depending on the strings. At the same time, the strings were chosen so that backwards compatibility is easier to maintain and we don't have to deprecate stale enum values or something in the future. Maybe it's fine, but I think the decision is worth revisiting.

gopherbot commented 10 months ago

Change https://go.dev/cl/546026 mentions this issue: doc: add release notes for the new execution tracer

gopherbot commented 10 months ago

Change https://go.dev/cl/546737 mentions this issue: trace: add script to generate experimental API from a Go checkout

mknyszek commented 10 months ago

The parsing API has now landed and will work with Go at tip. I'll update it when changes are made at tip.

dominikh commented 9 months ago

Overall I really like the API. Here are the things I've noticed are missing or confusing (in addition to the things we've discussed before):

No way to get the version of the trace. Since it's possible for trace versions to change semantics without adding new event types, it's useful for tools to know which version they're dealing with, even if most things are self-describing.
No way to get the start time of the trace. With the monotonic clock, it's useful to have a base timestamp. Right now, the only option is to get the timestamp of the first event, but that can be later than the trace start, and this might matter to users. Getting the timestamps of the EvBatch events is probably sufficient.
When/if you implement seeking or mmap, it'd be useful to be able to get an approximation of the number of events in the trace before reading all events with ReadEvent. I'd need that for progress bars in Gotraceui.
It's hard to tell which Ranges are actually local to their resource. I've seen STW starts being attributed to goroutines, but STW is of course process-wide. We can map from their names, but that's not very self-describing.

dominikh commented 9 months ago

Stack.Frames doesn't follow the latest state of the rangefunc proposal. At a minimum, it should lose the bool return value so that it's actually a valid iterator. Also, the proposal seems to suggest that methods shouldn't themselves be iterators, but return iterators. That is, func (Stack) Frames() func(yield func(StackFrame) bool). (i.e. func (Stack) Frames() iter.Seq[StackFrame], but without depending on the iter package.)

I also don't think that having an iterator is sufficient API. In Gotraceui, we want to index into stacks, preferably without being O(n). Judging from the current data structures involved, it should be possible to add Stack.At and Stack.Len methods.

dominikh commented 9 months ago

I'm somewhat worried about Label, its current implementation, and its semantics.

It seems that in the general sense, labels are able to associate a string with a resource, which is valid from the point of the Label event until the next Label event.

However, its only current use is to replicate the old GoStartLabel event, which had much tighter semantics than GoStart + Label, and was easier to use, for the two following reasons:

With GoStartLabel it was clear that the label described the kind of work being done by the goroutine for the duration of that running state, until the next state transition. With GoStart + Label, it is not obvious whether the label is only valid until the next state transition or not. Furthermore, the GoStart and Label events will have different timestamps. Does the goroutine have no label for that short duration? Or, if Label is valid until the next Label, does the goroutine still have the old label for that short duration?

In Gotraceui, we use labels as part of span labels, where each state transition, and some regions, create spans. Do I need a tiny span between GoStart and GoStartLabel? I don't for the current use of Label, which is to describe the kind of GC work being done by background workers. But what about future uses of labels?
It's useful to be able to refer to a single Event to describe a region, range, or state transition. In Gotraceui, the Span type refers to a single event and with the old parser, this provided all the information that one needed to know. Now I'd also have to carry around the label somehow. It'd be much nicer to have the information that Label provides be directly on the state transition instead, either as the reason (in the sense that a goroutine is transitioning to Running because it has to do GC (dedicated) work), or as the Label method working on state transition events.

dominikh commented 9 months ago

The range event names and state transition reasons are strings, so any analysis that is looking for something very particular will quickly end up depending on the strings. At the same time, the strings were chosen so that backwards compatibility is easier to maintain and we don't have to deprecate stale enum values or something in the future. Maybe it's fine, but I think the decision is worth revisiting.

Arguably having a list of enum values with documentation on which Go versions they are possible for is better for backwards compatibility than arbitrary strings. At the same time, strings provide for better forwards compatibility.

Not specific to the parser and more to tracing in general, but: can we rename traceBlockNet to traceBlockIO and use a reason that's not "network"? It's fairly confusing since the "netpoller" isn't limited to network I/O anymore. The runtime internal wait reason is called waitReasonIOWait

Syscalls should probably not be a state of a goroutine, but rather an EventRange set of events. Correct emission of syscall ranges really depended on treating them identically to EventRange ranges.

I think they make more sense as a state than a range. Syscalls are little different from blocking on other things, even if that might not be true for the implementation.

dominikh commented 9 months ago

I don't know if this is a quirk of the parser or the runtime, but this sequence of events is fairly unintuitive:

M=1125105 P=24 G=80 StateTransition Time=704285954576512 Resource=Goroutine(80) Reason="system goroutine wait" GoID=80 Running->Waiting
TransitionStack=
        runtime.gopark @ 0x43d8cd
                /home/dominikh/prj/go/src/runtime/proc.go:402
        runtime.gcBgMarkWorker @ 0x41e984
                /home/dominikh/prj/go/src/runtime/mgc.go:1310

Stack=
        runtime.gopark @ 0x43d8cd
                /home/dominikh/prj/go/src/runtime/proc.go:402
        runtime.gcBgMarkWorker @ 0x41e984
                /home/dominikh/prj/go/src/runtime/mgc.go:1310

M=1125128 P=8 G=-1 StateTransition Time=704285992235520 Resource=Proc(8) Reason="" ProcID=8 Running->Idle

M=1125124 P=8 G=80 StackSample Time=704285996324983
Stack=
        runtime.(*mspan).heapBitsSmallForAddr @ 0x417f1d
                /home/dominikh/prj/go/src/runtime/mbitmap_allocheaders.go:774
        runtime.(*mspan).typePointersOfUnchecked @ 0x4174d3
                /home/dominikh/prj/go/src/runtime/mbitmap_allocheaders.go:190
        runtime.scanobject @ 0x422e34
                /home/dominikh/prj/go/src/runtime/mgcmark.go:1446
        runtime.gcDrain @ 0x422793
                /home/dominikh/prj/go/src/runtime/mgcmark.go:1242
        runtime.gcDrainMarkWorkerDedicated @ 0x41ee04
                /home/dominikh/prj/go/src/runtime/mgcmark.go:1124
        runtime.gcBgMarkWorker.func2 @ 0x41edea
                /home/dominikh/prj/go/src/runtime/mgc.go:1387
        runtime.systemstack @ 0x472829
                /home/dominikh/prj/go/src/runtime/asm_amd64.s:509
        runtime.gcBgMarkWorker @ 0x41ea91
                /home/dominikh/prj/go/src/runtime/mgc.go:1370
        runtime.goexit @ 0x4747e0
                /home/dominikh/prj/go/src/runtime/asm_amd64.s:1695

Proc 8 and goroutine 80 are supposedly Idle and Waiting respectively when the stack sample is taken. Why does the event refer to them?

dominikh commented 9 months ago

I'm not sure if I should file issues for actual bugs, or keep them here so everything is in one place, but: the new parser crashes when it encounters a region end without a corresponding region begin. Such a trace can be constructed in the following way, which can probably happen in real code due to races, too:

func main() {
    trace.Start(io.Discard)
    r := trace.StartRegion(context.Background(), "my region")
    trace.Stop()
    trace.Start(os.Stdout)
    r.End()
    trace.Stop()
}

panic: runtime error: index out of range [-1]

goroutine 1 [running]:
internal/trace/v2.(*gState).endRegion(0x851e60?, {0x0, {0xc000013200, 0x9}})
    /home/dominikh/prj/go/src/internal/trace/v2/order.go:920 +0x191
internal/trace/v2.(*ordering).advance(0xc000206050, 0xc0000382f0, 0xc0000e5900, 0x1a1a0c, 0x2)
    /home/dominikh/prj/go/src/internal/trace/v2/order.go:659 +0x6bd1
internal/trace/v2.(*Reader).ReadEvent(0xc000206000)
    /home/dominikh/prj/go/src/internal/trace/v2/reader.go:161 +0x74a
cmd/trace/v2.debugProcessedEvents({0xc20aa0?, 0xc0000b2960?})
    /home/dominikh/prj/go/src/cmd/trace/v2/main.go:166 +0xf8
cmd/trace/v2.Main({0x7fff3054ca86?, 0xc0000222b0?}, {0x8c5dcb, 0xb}, {0xc00010fe10?, 0x7fa1ac?}, 0x1)
    /home/dominikh/prj/go/src/cmd/trace/v2/main.go:34 +0x11f8
main.main()
    /home/dominikh/prj/go/src/cmd/trace/main.go:87 +0xa4e

mknyszek commented 9 months ago

It's hard to tell which Ranges are actually local to their resource. I've seen STW starts being attributed to goroutines, but STW is of course process-wide. We can map from their names, but that's not very self-describing.

STW starts being attributed to goroutines is a little unintuitive but it's intentional. Some goroutine is responsible for stopping the world and "owns" that STW. This is useful for attributing e.g. ReadMemStats calls to specific goroutines. It's up to the caller to understand that this means something globally. On the other hand, GC ranges are scoped globally because a different goroutine might start/stop the GC, so it's much less useful to do any attribution.

No way to get the version of the trace. Since it's possible for trace versions to change semantics without adding new event types, it's useful for tools to know which version they're dealing with, even if most things are self-describing.

Huh, I figured this wouldn't be necessary with this new API. I tried to make it sufficiently general that it could accommodate most of not all future semantics changes. Could you describe in more detail what you had in mind? Do you mean the string names of ranges and such?

Stack.Frames doesn't follow the latest state of the rangefunc proposal.

Oops. Thought I dropped that. Thanks, I'll fix it.

I also don't think that having an iterator is sufficient API. In Gotraceui, we want to index into stacks, preferably without being O(n).

The intent of the iterator API is for the caller to copy out the stacks if they want. The purpose of the iterator is that they can be computed lazily. If it returned a slice of stack frames or allowed arbitrary indexing, those implementations might O(n) themselves which I'd like to avoid.

The Stack API is also designed to allow for efficient caching of this copying. You can create something like map[trace.Stack][]trace.StackFrame as the cache (in fact, the new cmd/trace does exactly this for profile creation).

I don't know if this is a quirk of the parser or the runtime, but this sequence of events is fairly unintuitive:

I agree it's unintuitive but it's hard to do anything else without even more complex synchronization involving signal handlers. :( EventStackSample events are documented as "best effort" and may not always line up with the states produced by the parser.

the new parser crashes when it encounters a region end without a corresponding region begin. Such a trace can be constructed in the following way, which can probably happen in real code due to races, too:

Ahhh, that's just a bug that's also present on tip. Can you please file a new issue for that so I can track it for the release? I'm a bit surprised that happens just because we do have a test for such a case, but it's probably just some simple error on my part.

dominikh commented 9 months ago

It's up to the caller to understand that this means something globally

But is that data the package should provide, instead of forcing everyone to implement it? Especially without the enums, one currently has to read the runtime source code to know of all the ranges that exist.

Huh, I figured this wouldn't be necessary with this new API. I tried to make it sufficiently general that it could accommodate most of not all future semantics changes. Could you describe in more detail what you had in mind? Do you mean the string names of ranges and such?

I might've jumped the gun here. We can certainly add it once it's needed. I was mostly thinking of the semantics of state transitions. It's true that the new API handles a lot of the state machine, but the different states still have meaning that tools might interpret, and these semantics can change. Of course I won't be able to provide an example until it happens.

The intent of the iterator API is for the caller to copy out the stacks if they want.

I guess that works and the overall memory usage won't be too terrible, although it is annoying to have to duplicate the data in the cases when it's already available in memory at O(1), like it is right now.

The Stack API is also designed to allow for efficient caching of this copying.

I can see how that would be simple in the "load all events into memory" use case, but once we support seeking and incremental (re)parsing, the lack of weak maps will make it hard to drop stacks and the evTables they refer to.

Can you please file a new issue for that so I can track it for the release?

Will do.

mknyszek commented 9 months ago

It's up to the caller to understand that this means something globally

But is that data the package should provide, instead of forcing everyone to implement it? Especially without the enums, one currently has to read the runtime source code to know of all the ranges that exist.

Yeah, good point. My thought was that by the time the range bubbled up to the user, the "stop-the-world" would be evident to a human. But you're right if tools want to do something with it, it's much harder. I wonder if there's a nicer middle-ground between enums and strings. Perhaps more structured strings, in the same vein as (but not the same format as) runtime/metrics?

I might've jumped the gun here. We can certainly add it once it's needed. I was mostly thinking of the semantics of state transitions. It's true that the new API handles a lot of the state machine, but the different states still have meaning that tools might interpret, and these semantics can change. Of course I won't be able to provide an example until it happens.

FWIW, I fully intended to commit to very specific semantics for Runnable/Running/Waiting. These are concepts that will continue to exist in perpetuity and I'm fine with forcing the parser to do any translation, if say we want to change the semantics of the low-level events for some performance optimization or something.

I see your point with something like Syscall, but that ties into why I want to make it a range type instead of a state. I expect us to have more flexibility there.

I guess that works and the overall memory usage won't be too terrible, although it is annoying to have to duplicate the data in the cases when it's already available in memory at O(1), like it is right now.

True. We could provide a convenience function, but I'm worried that it'll be relied upon. :( Or, the package could handle the caching itself, but as you point out below...

I can see how that would be simple in the "load all events into memory" use case, but once we support seeking and incremental (re)parsing, the lack of weak maps will make it hard to drop stacks and the evTables they refer to.

Yep, that is a really good point and does make this format less useful. Long-term I don't think weak-keyed maps are off the table, and they would be a great for this, but it's not going to happen overnight. I dunno.

dominikh commented 8 months ago

No way to get the version of the trace. Since it's possible for trace versions to change semantics without adding new event types, it's useful for tools to know which version they're dealing with, even if most things are self-describing.

Huh, I figured this wouldn't be necessary with this new API. I tried to make it sufficiently general that it could accommodate most of not all future semantics changes. Could you describe in more detail what you had in mind? Do you mean the string names of ranges and such?

A concrete case: old traces will not have useful information on Ms, while new traces do. I'd like to detect that and not display per-M timelines.

Edit: That's no longer the case, so I'm back to no concrete reasons.

dominikh commented 8 months ago

Michael and I discussed the nature of Ps and came to the conclusion that the core of the API might want to focus on Gs and Ms instead, leaving Ps as an implementation detail that takes additional effort to get to.

The way Ps are currently modeled in the API, they're higher level than the way Ps actually behave in the runtime, by for example not modeling the _Psyscall state explicitly. At the same time, the API emits transitions for Ps that encode P stealing, which requires consumers to be aware of _Psyscall anyway, as that's when Ps can be stolen. They also have to be aware of P stealing because otherwise these transitions look illogical.

We could model Ps more accurately and add the _Psyscall state, but that's an implementation detail through and through, and it wouldn't simplify any of the details around P stealing, it would only make the state machine slightly more accurate and explicit.

Discussing this raised the point that the entire concept of Ps is an implementation detail, and what most users really care about are goroutines running on OS threads, not Ps, which in an abstract sense are just tokens, and in a concrete sense primarily matter to people working on the runtime. In fact, with the old tracer, tools already treated Ps as if they were kinda-sorta Ms, just abstracted away from OS state.

To make the API easier to use for the majority of users, we could remove Ps, and move events pertaining to Ps to Ms instead. That is, Ms would be available to run / be running goroutines.

We'd still provide access to the per-P events from the underlying trace data, via an implementation-specific layer, without the guarantee that this data's shape won't change between Go versions. StateTransitions would gain additional data that allows linking between the high-level information and implementation-specific data, e.g. via P IDs.

This would allow most people to focus entirely on Ms and Gs, without having to understand the behavior of Ps to reconstruct the behavior of Ms, while providing an escape hatch to those who actually do care about Ps in their whole glory, P stealing and all.

gopherbot commented 7 months ago

Change https://go.dev/cl/566076 mentions this issue: trace: regenerate experimental API from go@d892cb4

aclements commented 2 months ago

I've been writing some tooling on top of this, too. Overall I've been happy with most of the API except for the experimental batch API. It may be that we just don't make this part of the API public, since I'm not sure it only makes sense to use unless you're also hacking on the runtime.

My main issue with the experimental batch API is that an ExperimentalData is attached to every ExperimentalEvent; however, it has no particular relationship to the event. This ExperimentalData is simply all of the batches for that experimental that are in the same generation as the event. This means a consumer needs to check if it's already parsed this ExperimentalData on every event that it processes because the same ExperimentalData will probably be attached to many, many events. At the same time, the consumer doesn't know when a generation changes, so it always has to be ready for a new ExperimentalData to appear. I'm not actually sure if it's safe to throw out the previously parsed ExperimentalData when you see a new one, or if the old and new can be interleaved for a while. If this isn't safe, then the consumer has to keep around old, potentially very large, parsed data because it doesn't know when it can be discarded.

A related issue is that each experiment only gets one stream of experimental batches. It's possible to work around this by putting a header of your own on each experimental batch, but it's pretty annoying to need another layer of demultiplexing, and this interacts poorly with the way a consumer has to be ready to parse all of the batches the moment a new ExperimentalData shows up.

I think what I want instead is a way to attach a potentially large amount of byte data to a particular event, but it's okay if the constant space overhead of that is somewhat high, so you have to attach a decent number of bytes to amortize the cost of the reference from the event.

In my case, I'm recording metadata about each heap span during sweep termination, and then all of the pointers followed during GC scanning. I initially did this as a ton of small events, but that was pretty inefficient and also screwed up trace timing, so I switched to using experimental batches and emitting far fewer events. It would have been much nicer if I could emit one event for "here are all of the spans" (or maybe a few events to split things up) and a lower rate of "here's a bunch of scanned pointers" events.