Open mknyszek opened 9 months ago
CC @aclements @prattmic @felixge @nsrip-dd @dominikh @bboreham @rhysh
I agree with all of the advantages of having flight recording in the runtime.
As for
A shortcoming of this document's proposal is that there's no way to trigger a snapshot explicitly against trace data, only program logic
I think that program logic will be the far more useful trigger for the majority of use cases, and certainly easier to make use of. Program logic can also make use of all the exposed runtime metrics, bringing it closer to the power of triggering on trace data.
On top of that, this flight recorder will be much more straightforward to integrate with existing telemetry solutions, compared to external solutions, which will either not be designed for that (Gotraceui) or will have to be written first.
It opens the door to taking trace snapshots before a program crashes
Just to say the most frequent "I wish I knew what happened just before" for me is OOM, which isn't (AFAIK) something you can trigger on once the kernel makes its decision. Maybe we could trigger on hitting GOMEMLIMIT?
Nice!
How can we get CPU profile samples to show up in the flight-recorder data stream? The existing APIs for execution traces and CPU profiles have a start time and an end time, so those two pair well together. For this it seems that we'd need a way to ask for SIGPROF
deliveries during the whole time the flight recorder is running, without preventing normal on-demand use of runtime.StartCPUProfile
.
The panics in WriteTo
look like they'd force those to coordinate with calls to Stop
(to avoid the "flight recorder isn't running" panic). Could they instead return errors if there's another WriteTo
call, and end early (with an error) if there's a concurrent Stop
call? Panic risk like this makes me nervous; I think instrumentation in general should Do No Harm.
consisting of short partitions corresponding approximately to the duration of the last
WriteTo
call
You've said that the data from WriteTo
calls won't overlap. But on the other side, do the partitions in the new data format include identifiers (partition sequence number $X from thread/M $Y, and a list of all Ms?) that allows stitching together traces that are known to have no gaps from a set of shorter WriteTo
calls?
Just to say the most frequent "I wish I knew what happened just before" for me is OOM, which isn't (AFAIK) something you can trigger on once the kernel makes its decision. Maybe we could trigger on hitting GOMEMLIMIT?
It turns out that's quite difficult to do because Linux provides no opportunity to dump any information when an OOM occurs; the OOM killer simply SIGKILLs processes. Unfortunately GOMEMLIMIT
also doesn't help, because GOGC=off GOMEMLIMIT=<value>
is a valid configuration. In that configuration, the program is always at GOMEMLIMIT
from the perspective of the operating system.
With that being said, there are still some best-effort things we might be able to do. Programs that don't use GOMEMLIMIT
, can look at their own runtime/metrics
stats and dump a trace when it total memory use (the same expression that the GOMEMLIMIT
machinery is using to account for total memory) exceeds some threshold. For programs that do use GOMEMLIMIT
, it's possible to use the new live heap metric to do something similiar. Something that doesn't account for newly allocated data, which will give an indication of how close the live memory of the program is to the limit.
Either way though, it's still best-effort. Unless the snapshotting fully stops the world, the program may continue executing and OOM before the trace actually gets fully dumped. (Even then, it's still possible (but less likely) that it OOMs before the runtime successfully stops every running goroutine.)
Once upon a time there was a patch proposed to Linux to allow for halting a process when it hit container limits, so that another process in a tiny memory reservation could inspect it. One could imagine that if this other process created a core dump of the halted process, we could write some code to extract any active trace buffers from the core dump into a trace.
I'm not sure where this leaves us. Perhaps it suggests that WriteTo
should really stop-the-world, and do all the copying during the stop-the-world, so that these difficult cases are more deterministic. But that hurts other cases like "I just want to know why my request is taking a long time sometimes," since it has a bigger run-time performance impact, especially if your intent is to capture multiple such cases (i.e. leave it running in production for a while) and aggregate them. And in the end, it's also just a patch on the broader problem which is that it's surprisingly hard to just get information about an OOM on Linux to begin with.
(Sorry, mis-click.)
Nice!
How can we get CPU profile samples to show up in the flight-recorder data stream? The existing APIs for execution traces and CPU profiles have a start time and an end time, so those two pair well together. For this it seems that we'd need a way to ask for
SIGPROF
deliveries during the whole time the flight recorder is running, without preventing normal on-demand use ofruntime.StartCPUProfile
.
If you happen to have CPU profiling running, it'll just work, but you make a good point that there's no good way to have it included all the time (i.e. no intention of producing a CPU profile). It seems to me like that should maybe be another option, either on the FlightRecorder
, or on some new CPU profile configuration value. I'm not sure where it would be better to add this so that it composes well. My gut tells me it should go on the CPU profile API. (One could imagine that CPU profiling could have a similar mode, where the profiling data is just kept in the buffer and snapshotted every once in a while. That would mirror flight recording, just like trace.Start mirrors StartCPUProfile (as you point out).)
The panics in
WriteTo
look like they'd force those to coordinate with calls toStop
(to avoid the "flight recorder isn't running" panic). Could they instead return errors if there's anotherWriteTo
call, and end early (with an error) if there's a concurrentStop
call? Panic risk like this makes me nervous; I think instrumentation in general should Do No Harm.
Hm, that's a good point. I'll update the API to return an error on Stop
and update the reasons WriteTo
could fail.
consisting of short partitions corresponding approximately to the duration of the last
WriteTo
callYou've said that the data from
WriteTo
calls won't overlap. But on the other side, do the partitions in the new data format include identifiers (partition sequence number $X from thread/M $Y, and a list of all Ms?) that allows stitching together traces that are known to have no gaps from a set of shorterWriteTo
calls?
There's currently no list of all the Ms per generation because we don't have M events, but yes, everything within a partition is namespaced by that partition's "generation number." The proposed trace parsing API exposes partition changes as a Sync
event, so it'll be possible to identify when it happens. The number won't be exposed, but users of the API can assign their own identifiers to each partition. (If you're parsing the trace directly, then yeah the generation number will be available.)
FWIW, the trace parsing API already does this exact kind of "stitching." Every partition is an entirely self-contained trace, which means all goroutines (and Ps) and their statuses get named in every partition. The trace parser uses this information to validate the stitching: new partitions' goroutine statuses need to match where that goroutine left off in the previous partition.
This proposal has been added to the active column of the proposals project and will now be reviewed at the weekly proposal review meetings. — rsc for the proposal review group
I think this is waiting on an implementation and experience from using that implementation.
That's correct. I plan to have one in golang.org/x/exp/trace soon.
I made a mistake in the commit message so gopherbot couldn't connect the CL to this issue, but the experimental API has now landed via https://go.dev/cl/550257. It is available in golang.org/x/exp/trace for Go 1.22 only.
There are a few caveats with this implementation that will not be true with a real runtime implementation.
WriteTo
) is higher because the implementation actually needs to make 2 consecutive global buffer flushes (the results of both appear in the output, so in practice it shouldn't make a difference for analysis). This is because the end of a trace generation isn't explicitly signaled but derived from new trace data. This was perhaps a mistake in the format that can be fixed in a backwards-compatible way, but the runtime implementation has complete knowledge of when a generation starts and ends, so this won't be a problem in the future.I don't think any of these are significant enough to detract from the usefulness of the experiment, but I wanted to bring it up in case one of these does become an issue. We can also explore ways to improve the experiment to make it more representative, if one of them is indeed a problem.
Please give it a try!
The x/exp/trace
flight recorder seems to work nicely for me so far. Thanks for adding it!
One limitation I encountered is that the runtime/trace
package does not allow multiple trace.Start calls at the same time, so I needed to disable capturing runtime traces in the DataDog library with DD_PROFILING_EXECUTION_TRACE_ENABLED=false
env variable. I wonder if we could enable multiple consumers of the trace data at the same time. cc @felixge
As for preserving traces after OOM, one possible approach could be to use a separate agent process with a shared memory buffer to write the traces into. That agent could then detect that the main process crashed and write the contents of the memory buffer into a file or other storage. However, this would require keeping all the flight recorder data in a single pre-allocated buffer.
Thanks for trying it out!
One limitation I encountered is that the
runtime/trace
package does not allow multiple trace.Start calls at the same time, so I needed to disable capturing runtime traces in the DataDog library withDD_PROFILING_EXECUTION_TRACE_ENABLED=false
env variable. I wonder if we could enable multiple consumers of the trace data at the same time.
Yeah, that's a limitation of the experiment which lives out-of-tree for convenience, but it technically shouldn't be necessary if it's implemented in the runtime. I'd hoped to allow one of each type of consumer (one active flight recorder, one active trace.Start
).
I wonder if we could enable multiple consumers of the trace data at the same time.
@mknyszek , we've discussed allowing the caller to control the set of events that the trace includes. That would enable runtime maintainers to use the execution trace to grant access to events that are niche and expensive, or in the other direction would enable low-overhead access to a subset of events.
Maybe this is how we build/improve delta profiles for #67942, and how we give access to goroutine tags/labels in monotonic profiles (heap, mutex, etc) for #23458.
If that's the direction we go, I think we'll want either a/ multiple trace.Start
consumers (trace.(*Config).Start
?), each with their own set of events or b/ a way to change the set of events without restarting the trace, so a user-space package like x/exp/trace can be the single subscriber on behalf of multiple others within the program.
Multiple trace.Start
consumers isn't too hard to do (each new trace.Start
registers another place for the trace reader goroutine to write to, and calls traceAdvance
if the trace is already running to get a consistent starting point). But customizing each one is harder.
The only way I can imagine this working efficiently is via something like https://go.dev/cl/594595. TL;DR: We have some light infrastructure for emitting "experimental" events now, and this CL formalizes that a bit better by allowing us to split the event stream. It's at least a step toward that. If we have separate buffer streams, you could imagine a new registration consists of an io.Writer
and a bitmask indicating which streams to accept data from (internally).
It's certainly not off the table, but I think we're a little ways off still.
It turns out that's quite difficult to do because Linux provides no opportunity to dump any information when an OOM occurs; the OOM killer simply SIGKILLs processes. Unfortunately GOMEMLIMIT also doesn't help, because GOGC=off GOMEMLIMIT=
is a valid configuration. In that configuration, the program is always at GOMEMLIMIT from the perspective of the operating system.
I'd like to point out that in Linux it's now possible to be actively informed by the kernel through an epoll event if a control group is under memory pressure through the PSI subsystem.
https://systemd.io/MEMORY_PRESSURE/ https://docs.kernel.org/accounting/psi.html#monitoring-for-pressure-thresholds
This could be used as a signal to dump a trace and/or to trigger GC proactively.
Proposal: runtime/trace flight recording
Updated: 23 September 2023
Background
"Flight recording" is a technique in which trace data is kept in a conceptual circular buffer, flushed upon request. The purpose of this technique is to capture traces of interesting program behavior, even when one does not know ahead of time when that will happen. For instance, if the web service fails a health check, or the web service takes an abnormally long time to handle a request. Specifically, the web service can identify such conditions when they happen, but the programmer setting up the environment can't predict when exactly they will occur. Starting tracing after something interesting happens also tends not to be useful, because the program has already executed the interesting part.
The Java ecosystem has had this for years through Java's flight recorder. Once the JVM's flight recorder is enabled, the JVM can obtain a trace representing the last few seconds of time. This trace can come from triggers set up in JMX, or by passing a flag to the JVM that dumps a trace on exit.
With the implementation of #60773 now approaching a stable point, hopefully in Go 1.22 we'll have all traces become series of self-contained partitions. This implementation change presents an opportunity to easily add something similar to the Go execution tracer by always retaining at least one partition that can be snapshotted at any time.
This is also enabled by work in the Go 1.21 release to make traces dramatically cheaper. Because flight recording relies on waiting until something interesting happens, tracing needs to be enabled for a much longer period of time. Enabling flight recording across, for example, a small portion of a production fleet, becomes much more palatable when the tracing itself isn't too expensive.
Design
The core of the design is a new API in the
runtime/trace
package to enable flight recording. This means that programs can be instrumented with their own triggers.Implementation
Because #60773 already partitions traces, most of the implementation work is already done. There are just a few additional changes that need to be made.
First, the values from
SetPeriod
andSetSize
need to adjust runtime-internal variables when(*FlightRecorder).Start
is called. The period already exists, but the size in bytes does not. The tracer will need to track a partition's size and the background goroutine that creates new partitions will need to be woken when the partition's size exceeds that value.Next, the trace reader goroutine puts buffers it takes off the queue onto a free list when it's done with them. When flight recording is enabled simultaneously with a
trace.Start
, the trace reader will need to put these buffers on some other list in case the flight recorder needs them. The task of putting those buffers on the reuse list will be handled by whichever goroutine creates a new partition, and it will only place the oldest partition's buffers on the reuse list. This way, at least one full partition will always be available for the flight recorder.Lastly, when
(*FlightRecorder).WriteTo
is called, it will need to create a new partition. Then, it will need to take ownership of any outstanding trace buffers by setting a flag on each buffer preventing it from going on the reuse list. This will likely require a brief stop-the-world, but the actual copying will not happen while the world is stopped. Once copying is done,(*FlightRecorder).WriteTo
will wait until the trace reader is done with the partitions for the buffers it was using (if a trace reader is active) and then place all the buffers it marked on the reuse list.Discussion
SetPeriod
andSetSize
could give more rigid guarantees, but it's both complex to implement and not terribly useful. The primary use-case forSetPeriod
is to allow users to ask for longer traces (for example, if a web service's "long request" means something much longer than a handful of seconds). Meanwhile the primary use-case ofSetSize
is to control memory overheads and limit the damage caused by a largeSetPeriod
.WriteTo
could allow multiple goroutines to call it since it could easily serialize them internally. However, this can create some bad situations. For instance, consider some snapshot trigger condition that causes multiple goroutines to callWriteTo
. The call is heavyweight and they'll queue up behind each other; the longest one will likely take quite a while to resolve, and the application will be significantly disrupted. It'll also produce traces that aren't very useful (consisting of short partitions corresponding approximately to the duration of the lastWriteTo
call) unless we also allow for multiple goroutines to read the same partition's buffers. However, that's going to be fairly complicated to implement, and also doesn't really add much value either, since it's just duplicate data. The current design encourages callers reduces the risk of run-time panics while also side-stepping these issues by returning an error in this case.Alternatives considered
External circular buffer
@dominikh has suggested adding a similar feature to gotraceui. Because the partitioning is actually baked into the trace's structure, it's possible for trace consumers to implement something similar themselves. The only thing missing is a standard streaming endpoint for execution traces (see follow-up work).
However, there are a few advantages to adding support directly to the standard library and runtime.
runtime/trace
package's annotations, but can instead decide when to grab a snapshot directly.SetPeriod
andSetSize
). Any external solution will be at the mercy of the runtime's defaults.Despite these advantages, it's likely worth pursuing support for such a use-case even if the API described in this proposal is made available. A shortcoming of this document's proposal is that there's no way to trigger a snapshot explicitly against trace data, only program logic. Handling traces externally also means the ability to perform ad-hoc analyses without the need for additional instrumentation.
Follow-up work
Add support for trace streaming endpoints to
net/http/pprof
As stated in the discussion of the "external circular buffer" alternative, we could support that alternative easily and well by just adding a standard debugging endpoint for streaming trace data. It probably makes the most sense to just add new query parameters to the existing
trace
endpoint; the details of that can be hashed out elsewhere.