golang / go

The Go programming language
https://go.dev
BSD 3-Clause "New" or "Revised" License
124.23k stars 17.7k forks source link

proposal: x/debug/trace: add package for parsing execution traces #62627

Open mknyszek opened 1 year ago

mknyszek commented 1 year ago

As part of #60773 (tracking issue #57175) I've been working on a new parser for the execution tracer, and to save work down the line I've also been trying to come up with a nice API that would work well.

The goals of such an API would be to:

Here are the requirements I have for such an API:

Any such API is likely to have a fairly large surface. Rather than try to list the full API and debate its merits, we should experiment.

https://go.dev/cl/527875 is my first attempt at such an experiment to support the work done for #60773. The implementation is not quite fully baked (there may be bugs, and there's still some validation missing) but it already seems to work for non-trivial traces produced by https://go.dev/cl/494187 (with GOEXPERIMENT=exectracer2). (I'll update this section when it's ready, including with instructions on how to try it out.)

I plan to take this API and vendor it and/or copy it into the standard library to work as the implementation for supporting new traces in internal/trace. That'll help move #60773 along but it'll also give us some experience with using the API.

Once I get this API landed in x/exp, and the first cut of the new tracer for #60773, I'd like to invite people to experiment to both shake out bugs but also to identify deficiencies in the API.

For the proposal committee: there's nothing actionable here yet, but I figured I'd create an issue to track work and start a discussion.

Related to #62461.

CC @prattmic @aclements @dominikh @felixge @nsrip-dd @bboreham @rhysh

gopherbot commented 1 year ago

Change https://go.dev/cl/544655 mentions this issue: trace: export trace parsing API from std

mknyszek commented 1 year ago

I uploaded CL of a decent first cut to golang.org/x/exp: https://go.dev/cl/544655.

@felixge and I got some experience using this to port cmd/trace over to the new trace parser, and here are some of my notes so far:

gopherbot commented 11 months ago

Change https://go.dev/cl/546026 mentions this issue: doc: add release notes for the new execution tracer

gopherbot commented 11 months ago

Change https://go.dev/cl/546737 mentions this issue: trace: add script to generate experimental API from a Go checkout

mknyszek commented 11 months ago

The parsing API has now landed and will work with Go at tip. I'll update it when changes are made at tip.

dominikh commented 11 months ago

Overall I really like the API. Here are the things I've noticed are missing or confusing (in addition to the things we've discussed before):

dominikh commented 11 months ago

Stack.Frames doesn't follow the latest state of the rangefunc proposal. At a minimum, it should lose the bool return value so that it's actually a valid iterator. Also, the proposal seems to suggest that methods shouldn't themselves be iterators, but return iterators. That is, func (Stack) Frames() func(yield func(StackFrame) bool). (i.e. func (Stack) Frames() iter.Seq[StackFrame], but without depending on the iter package.)

I also don't think that having an iterator is sufficient API. In Gotraceui, we want to index into stacks, preferably without being O(n). Judging from the current data structures involved, it should be possible to add Stack.At and Stack.Len methods.

dominikh commented 11 months ago

I'm somewhat worried about Label, its current implementation, and its semantics.

It seems that in the general sense, labels are able to associate a string with a resource, which is valid from the point of the Label event until the next Label event.

However, its only current use is to replicate the old GoStartLabel event, which had much tighter semantics than GoStart + Label, and was easier to use, for the two following reasons:

dominikh commented 11 months ago

The range event names and state transition reasons are strings, so any analysis that is looking for something very particular will quickly end up depending on the strings. At the same time, the strings were chosen so that backwards compatibility is easier to maintain and we don't have to deprecate stale enum values or something in the future. Maybe it's fine, but I think the decision is worth revisiting.

Arguably having a list of enum values with documentation on which Go versions they are possible for is better for backwards compatibility than arbitrary strings. At the same time, strings provide for better forwards compatibility.

Not specific to the parser and more to tracing in general, but: can we rename traceBlockNet to traceBlockIO and use a reason that's not "network"? It's fairly confusing since the "netpoller" isn't limited to network I/O anymore. The runtime internal wait reason is called waitReasonIOWait

Syscalls should probably not be a state of a goroutine, but rather an EventRange set of events. Correct emission of syscall ranges really depended on treating them identically to EventRange ranges.

I think they make more sense as a state than a range. Syscalls are little different from blocking on other things, even if that might not be true for the implementation.

dominikh commented 11 months ago

I don't know if this is a quirk of the parser or the runtime, but this sequence of events is fairly unintuitive:

M=1125105 P=24 G=80 StateTransition Time=704285954576512 Resource=Goroutine(80) Reason="system goroutine wait" GoID=80 Running->Waiting
TransitionStack=
        runtime.gopark @ 0x43d8cd
                /home/dominikh/prj/go/src/runtime/proc.go:402
        runtime.gcBgMarkWorker @ 0x41e984
                /home/dominikh/prj/go/src/runtime/mgc.go:1310

Stack=
        runtime.gopark @ 0x43d8cd
                /home/dominikh/prj/go/src/runtime/proc.go:402
        runtime.gcBgMarkWorker @ 0x41e984
                /home/dominikh/prj/go/src/runtime/mgc.go:1310

M=1125128 P=8 G=-1 StateTransition Time=704285992235520 Resource=Proc(8) Reason="" ProcID=8 Running->Idle

M=1125124 P=8 G=80 StackSample Time=704285996324983
Stack=
        runtime.(*mspan).heapBitsSmallForAddr @ 0x417f1d
                /home/dominikh/prj/go/src/runtime/mbitmap_allocheaders.go:774
        runtime.(*mspan).typePointersOfUnchecked @ 0x4174d3
                /home/dominikh/prj/go/src/runtime/mbitmap_allocheaders.go:190
        runtime.scanobject @ 0x422e34
                /home/dominikh/prj/go/src/runtime/mgcmark.go:1446
        runtime.gcDrain @ 0x422793
                /home/dominikh/prj/go/src/runtime/mgcmark.go:1242
        runtime.gcDrainMarkWorkerDedicated @ 0x41ee04
                /home/dominikh/prj/go/src/runtime/mgcmark.go:1124
        runtime.gcBgMarkWorker.func2 @ 0x41edea
                /home/dominikh/prj/go/src/runtime/mgc.go:1387
        runtime.systemstack @ 0x472829
                /home/dominikh/prj/go/src/runtime/asm_amd64.s:509
        runtime.gcBgMarkWorker @ 0x41ea91
                /home/dominikh/prj/go/src/runtime/mgc.go:1370
        runtime.goexit @ 0x4747e0
                /home/dominikh/prj/go/src/runtime/asm_amd64.s:1695

Proc 8 and goroutine 80 are supposedly Idle and Waiting respectively when the stack sample is taken. Why does the event refer to them?

dominikh commented 11 months ago

I'm not sure if I should file issues for actual bugs, or keep them here so everything is in one place, but: the new parser crashes when it encounters a region end without a corresponding region begin. Such a trace can be constructed in the following way, which can probably happen in real code due to races, too:

func main() {
    trace.Start(io.Discard)
    r := trace.StartRegion(context.Background(), "my region")
    trace.Stop()
    trace.Start(os.Stdout)
    r.End()
    trace.Stop()
}
panic: runtime error: index out of range [-1]

goroutine 1 [running]:
internal/trace/v2.(*gState).endRegion(0x851e60?, {0x0, {0xc000013200, 0x9}})
    /home/dominikh/prj/go/src/internal/trace/v2/order.go:920 +0x191
internal/trace/v2.(*ordering).advance(0xc000206050, 0xc0000382f0, 0xc0000e5900, 0x1a1a0c, 0x2)
    /home/dominikh/prj/go/src/internal/trace/v2/order.go:659 +0x6bd1
internal/trace/v2.(*Reader).ReadEvent(0xc000206000)
    /home/dominikh/prj/go/src/internal/trace/v2/reader.go:161 +0x74a
cmd/trace/v2.debugProcessedEvents({0xc20aa0?, 0xc0000b2960?})
    /home/dominikh/prj/go/src/cmd/trace/v2/main.go:166 +0xf8
cmd/trace/v2.Main({0x7fff3054ca86?, 0xc0000222b0?}, {0x8c5dcb, 0xb}, {0xc00010fe10?, 0x7fa1ac?}, 0x1)
    /home/dominikh/prj/go/src/cmd/trace/v2/main.go:34 +0x11f8
main.main()
    /home/dominikh/prj/go/src/cmd/trace/main.go:87 +0xa4e
mknyszek commented 11 months ago

It's hard to tell which Ranges are actually local to their resource. I've seen STW starts being attributed to goroutines, but STW is of course process-wide. We can map from their names, but that's not very self-describing.

STW starts being attributed to goroutines is a little unintuitive but it's intentional. Some goroutine is responsible for stopping the world and "owns" that STW. This is useful for attributing e.g. ReadMemStats calls to specific goroutines. It's up to the caller to understand that this means something globally. On the other hand, GC ranges are scoped globally because a different goroutine might start/stop the GC, so it's much less useful to do any attribution.

No way to get the version of the trace. Since it's possible for trace versions to change semantics without adding new event types, it's useful for tools to know which version they're dealing with, even if most things are self-describing.

Huh, I figured this wouldn't be necessary with this new API. I tried to make it sufficiently general that it could accommodate most of not all future semantics changes. Could you describe in more detail what you had in mind? Do you mean the string names of ranges and such?

Stack.Frames doesn't follow the latest state of the rangefunc proposal.

Oops. Thought I dropped that. Thanks, I'll fix it.

I also don't think that having an iterator is sufficient API. In Gotraceui, we want to index into stacks, preferably without being O(n).

The intent of the iterator API is for the caller to copy out the stacks if they want. The purpose of the iterator is that they can be computed lazily. If it returned a slice of stack frames or allowed arbitrary indexing, those implementations might O(n) themselves which I'd like to avoid.

The Stack API is also designed to allow for efficient caching of this copying. You can create something like map[trace.Stack][]trace.StackFrame as the cache (in fact, the new cmd/trace does exactly this for profile creation).

I don't know if this is a quirk of the parser or the runtime, but this sequence of events is fairly unintuitive:

I agree it's unintuitive but it's hard to do anything else without even more complex synchronization involving signal handlers. :( EventStackSample events are documented as "best effort" and may not always line up with the states produced by the parser.

the new parser crashes when it encounters a region end without a corresponding region begin. Such a trace can be constructed in the following way, which can probably happen in real code due to races, too:

Ahhh, that's just a bug that's also present on tip. Can you please file a new issue for that so I can track it for the release? I'm a bit surprised that happens just because we do have a test for such a case, but it's probably just some simple error on my part.

dominikh commented 11 months ago

It's up to the caller to understand that this means something globally

But is that data the package should provide, instead of forcing everyone to implement it? Especially without the enums, one currently has to read the runtime source code to know of all the ranges that exist.

Huh, I figured this wouldn't be necessary with this new API. I tried to make it sufficiently general that it could accommodate most of not all future semantics changes. Could you describe in more detail what you had in mind? Do you mean the string names of ranges and such?

I might've jumped the gun here. We can certainly add it once it's needed. I was mostly thinking of the semantics of state transitions. It's true that the new API handles a lot of the state machine, but the different states still have meaning that tools might interpret, and these semantics can change. Of course I won't be able to provide an example until it happens.

The intent of the iterator API is for the caller to copy out the stacks if they want.

I guess that works and the overall memory usage won't be too terrible, although it is annoying to have to duplicate the data in the cases when it's already available in memory at O(1), like it is right now.

The Stack API is also designed to allow for efficient caching of this copying.

I can see how that would be simple in the "load all events into memory" use case, but once we support seeking and incremental (re)parsing, the lack of weak maps will make it hard to drop stacks and the evTables they refer to.

Can you please file a new issue for that so I can track it for the release?

Will do.

mknyszek commented 10 months ago

It's up to the caller to understand that this means something globally

But is that data the package should provide, instead of forcing everyone to implement it? Especially without the enums, one currently has to read the runtime source code to know of all the ranges that exist.

Yeah, good point. My thought was that by the time the range bubbled up to the user, the "stop-the-world" would be evident to a human. But you're right if tools want to do something with it, it's much harder. I wonder if there's a nicer middle-ground between enums and strings. Perhaps more structured strings, in the same vein as (but not the same format as) runtime/metrics?

I might've jumped the gun here. We can certainly add it once it's needed. I was mostly thinking of the semantics of state transitions. It's true that the new API handles a lot of the state machine, but the different states still have meaning that tools might interpret, and these semantics can change. Of course I won't be able to provide an example until it happens.

FWIW, I fully intended to commit to very specific semantics for Runnable/Running/Waiting. These are concepts that will continue to exist in perpetuity and I'm fine with forcing the parser to do any translation, if say we want to change the semantics of the low-level events for some performance optimization or something.

I see your point with something like Syscall, but that ties into why I want to make it a range type instead of a state. I expect us to have more flexibility there.

I guess that works and the overall memory usage won't be too terrible, although it is annoying to have to duplicate the data in the cases when it's already available in memory at O(1), like it is right now.

True. We could provide a convenience function, but I'm worried that it'll be relied upon. :( Or, the package could handle the caching itself, but as you point out below...

I can see how that would be simple in the "load all events into memory" use case, but once we support seeking and incremental (re)parsing, the lack of weak maps will make it hard to drop stacks and the evTables they refer to.

Yep, that is a really good point and does make this format less useful. Long-term I don't think weak-keyed maps are off the table, and they would be a great for this, but it's not going to happen overnight. I dunno.

dominikh commented 10 months ago

No way to get the version of the trace. Since it's possible for trace versions to change semantics without adding new event types, it's useful for tools to know which version they're dealing with, even if most things are self-describing.

Huh, I figured this wouldn't be necessary with this new API. I tried to make it sufficiently general that it could accommodate most of not all future semantics changes. Could you describe in more detail what you had in mind? Do you mean the string names of ranges and such?

A concrete case: old traces will not have useful information on Ms, while new traces do. I'd like to detect that and not display per-M timelines.

Edit: That's no longer the case, so I'm back to no concrete reasons.

dominikh commented 10 months ago

Michael and I discussed the nature of Ps and came to the conclusion that the core of the API might want to focus on Gs and Ms instead, leaving Ps as an implementation detail that takes additional effort to get to.

The way Ps are currently modeled in the API, they're higher level than the way Ps actually behave in the runtime, by for example not modeling the _Psyscall state explicitly. At the same time, the API emits transitions for Ps that encode P stealing, which requires consumers to be aware of _Psyscall anyway, as that's when Ps can be stolen. They also have to be aware of P stealing because otherwise these transitions look illogical.

We could model Ps more accurately and add the _Psyscall state, but that's an implementation detail through and through, and it wouldn't simplify any of the details around P stealing, it would only make the state machine slightly more accurate and explicit.

Discussing this raised the point that the entire concept of Ps is an implementation detail, and what most users really care about are goroutines running on OS threads, not Ps, which in an abstract sense are just tokens, and in a concrete sense primarily matter to people working on the runtime. In fact, with the old tracer, tools already treated Ps as if they were kinda-sorta Ms, just abstracted away from OS state.

To make the API easier to use for the majority of users, we could remove Ps, and move events pertaining to Ps to Ms instead. That is, Ms would be available to run / be running goroutines.

We'd still provide access to the per-P events from the underlying trace data, via an implementation-specific layer, without the guarantee that this data's shape won't change between Go versions. StateTransitions would gain additional data that allows linking between the high-level information and implementation-specific data, e.g. via P IDs.

This would allow most people to focus entirely on Ms and Gs, without having to understand the behavior of Ps to reconstruct the behavior of Ms, while providing an escape hatch to those who actually do care about Ps in their whole glory, P stealing and all.

gopherbot commented 9 months ago

Change https://go.dev/cl/566076 mentions this issue: trace: regenerate experimental API from go@d892cb4

aclements commented 4 months ago

I've been writing some tooling on top of this, too. Overall I've been happy with most of the API except for the experimental batch API. It may be that we just don't make this part of the API public, since I'm not sure it only makes sense to use unless you're also hacking on the runtime.

My main issue with the experimental batch API is that an ExperimentalData is attached to every ExperimentalEvent; however, it has no particular relationship to the event. This ExperimentalData is simply all of the batches for that experimental that are in the same generation as the event. This means a consumer needs to check if it's already parsed this ExperimentalData on every event that it processes because the same ExperimentalData will probably be attached to many, many events. At the same time, the consumer doesn't know when a generation changes, so it always has to be ready for a new ExperimentalData to appear. I'm not actually sure if it's safe to throw out the previously parsed ExperimentalData when you see a new one, or if the old and new can be interleaved for a while. If this isn't safe, then the consumer has to keep around old, potentially very large, parsed data because it doesn't know when it can be discarded.

A related issue is that each experiment only gets one stream of experimental batches. It's possible to work around this by putting a header of your own on each experimental batch, but it's pretty annoying to need another layer of demultiplexing, and this interacts poorly with the way a consumer has to be ready to parse all of the batches the moment a new ExperimentalData shows up.

I think what I want instead is a way to attach a potentially large amount of byte data to a particular event, but it's okay if the constant space overhead of that is somewhat high, so you have to attach a decent number of bytes to amortize the cost of the reference from the event.

In my case, I'm recording metadata about each heap span during sweep termination, and then all of the pointers followed during GC scanning. I initially did this as a ton of small events, but that was pretty inefficient and also screwed up trace timing, so I switched to using experimental batches and emitting far fewer events. It would have been much nicer if I could emit one event for "here are all of the spans" (or maybe a few events to split things up) and a lower rate of "here's a bunch of scanned pointers" events.