golang / go

The Go programming language

https://go.dev

BSD 3-Clause "New" or "Revised" License

124.24k stars 17.7k forks source link

runtime: diagnostics improvements tracking issue #57175

Open mknyszek opened 1 year ago

mknyszek commented 1 year ago

As the Go user base grows, more and more Go developers are seeking to understand the performance of their programs and reduce resource costs. However, they are locked into the relatively limited diagnostic tools we provide today. Some teams build their own tools, but right now that requires a large investment. This issue extends to the Go team as well, where we often put significant effort into ad-hoc performance tooling to analyze the performance of Go itself.

This issue is a tracking issue for improving the state of Go runtime diagnostics and its tooling, focusing primarily on runtime/trace traces and heap analysis tooling.

To do this work, we the Go team are collaborating with @felixge and @nsrip-dd and with input from others in the Go community. We currently have a virtual sync every 2 weeks (starting 2022-12-07), Thursdays at 11 AM NYC time. Please ping me at mknyszek -- at -- golang.org for an invite if you're interested in attending. This issue will be updated regularly with meeting notes from those meetings.

Below is what we currently plan to work on and explore, organized by broader effort and roughly prioritized. Note that this almost certainly will change as work progresses and more may be added.

Runtime tracing

Tracing usability

[x] All STW events appear in the trace (CL)
[x] Tracer overhaul (#60773)
- [x] Public design document (CC @mknyszek)
[x] Public trace parsing API (#62627)
[x] Explore a "flight recorder" mode (#63185)
[ ] Integrate pprof labels into traces (#56295) (CC @nsrip-dd, @felixge)
[ ] Explore Perfetto integration and replacement of our existing trace viewer (#57159)
[ ] Explore API for registering a runtime/trace handler and namespace (TODO)
- [ ] Explore having standard library packages use this for additional information (TODO)

Tracing performance

[x] Establish baseline performance with benchmarks (CC @mknyszek, @felixge)
- [x] Add trace overhead benchmarking support to Sweet (CC @mknyszek)
[ ] API for filtering events that appear in a trace (CC @felixge, @nsrip-dd)
[x] Faster tracebacks
- [x] gentraceback refactoring (#54466) (CC @felixge, @aclements)
- [x] Explore frame pointer unwinding (#16638)

Heap analysis (see #57447)

[ ] Update viewcore's internal core file libraries (gocore and core) to work with Go at tip.
[ ] Ensure that gocore and core are well-tested, and tested at tip.
[ ] Make gocore and core externally-visible APIs, allowing Go developers to build on top of it.
[ ] Write go.dev documentation on how to do heap analysis with viewcore.

CC @aclements @prattmic @felixge @nsrip-dd @rhysh @dominikh

mknyszek commented 1 year ago

2022-12-07 Sync

Attendees: @mknyszek @aclements @prattmic @felixge @nsrip-dd @rhysh

Notes:

pprof labels in execution traces
- Michael K: I need to follow up on the CL and issue.
- Michael P: Have you considered runtime/trace regions?
- Nick: Yes, but it doesn't quite hit our use-cases.
- Rhys: Could use logs instead. Is the point to follow goroutines?
- Felix: Yeah.
- Rhys: Started parsing the trace format for some internal tooling.
- Felix: Prefer labels because inheritance. Mostly like it for profile tools on top, but maybe doesn't matter for tracing tools. The parent goroutine gets recorded so things like request IDs can be tracked in post-processing.
  - Also useful for attributing CPU time.
- Michael P: Tasks are inherited, but inheritance is processed downstream.
  - Would be nice to have pprof labels just so you don't have to think about which one to use.
- Michael K: Useful even just to bridge pprof and runtime/trace.
- Austin: Agreed. We can one day get to the point of deprecating the old APIs as well.
- Rhys: RE: Attributing CPU time, can see at better than 10 ms of granularity already (even if Ps aren't CPU time, it's still time the rest of the app couldn't run).
- Michael P: There's an issue about measure CPU time on-line. (TODO: Find it.)
Trace parsing API
- Michael K: How important is this? Priority?
- Felix: Important for the community, but we can't make use of it in our case. Will the trace format change in between releases?
- Michael K: I think we can always guarantee a trace format for a release.
- Michael P: How high level should this API be?
  - cmd/trace has two levels:
    - Low-level that understands format and events.
    - Higher-level that understands relationships between goroutines, etc.
- Michael K: Page trace splits this into "parser" and "simulator." The latter is more stateful.
- Felix: Intuitive feeling toward lower level API.
- Rhys: +1 to low level.
- Austin: Scalability of processing traces.
  - Currently not in a good state in low or high level format (currently requires the whole trace).
  - Can fix trace wire format for low-level parsing scalability issues, but it's much less clear how to do this for the high-level format.
- Austin: Flight recorder idea.
  - Interacts interestingly with simulation. Current trace snapshots everything.
  - Solved this in debuglog; reads its own tail and keeps local state updated.
  - Complicated trade-offs in this space.
- Felix: We use a lot of JFR, one thing that's nice is it's broken down into self-contained chunks.
Michael K sent out a very half-baked trace format revamp. (Thanks for the comments! Far from ready to share more widely.)
- The next step is to measure the actual current overhead.
  - Maybe add a mode to Sweet?
  - Rhys: Have been collecting CPU profiles and execution traces. 20% of CPU time during execution trace is for execution trace itself. 95% of overhead is collecting stack traces.
    - Collect 1 second every 5000 seconds and no one complains. People do complain about goroutine profiles every 2 minutes.
  - Michael K: Shooting for KUTrace overhead, so making stack traces optional/faster is just step 1.
  - Felix: Trace effect on tail latency.
    - Rhys: Traces are my view of tail latency.
  - Felix: Benchmark for pathological cases and worst case.
  - Austin: Linked to trace overhead issue, Dmitry proposed switching to frame pointer unwinding.
  - Felix: At some point implemented frame pointer unwinding in userland and it was 50x faster (link).
  - Rhys: Not sure what kind of tool you could build without stack traces in an app that doesn't set pprof labels, tasks, regions, trace logs, etc.
  - Michael K: Integration with structured logging?
  - Michael P: It does add yet another way to add things to runtime/trace.
  - Rhys: The standard library (e.g. database/sql) doesn't currently use runtime/trace at all, maybe it should.
  - Michael K: This connects to deciding what goes into a trace. I think this is a very good idea.
  - Felix: +1. Java world and JFR does this.
Deciding what goes into a trace
- Disabling stack tracing / reduce stack trace depth
- Filtering by pprof labels
- Specific event classes
- Standard library events
- Rhys: I've made this decision for my organization. Expected that you do profiling for a long running service. No opportunity for app owners to express opinions. People who complained forked the package, turned it off, and now coming back. I kind of want everything.
- Felix: I would love to be in a place where we can do that, but we get pushback from users when the overhead is too high.
- Rhys: The question is how close we get within 1% overhead. My choice was to get everything, but less often.
- Felix: Desire to get more of everything is in conflict with adding more kinds of things in the trace.
- Michael P: Agreed. Ideally we have tracing that's sufficiently fast that we have on all the time, but if libraries are allowed to add new traces then it could be a problem. It would be nice to turn that off without forking a library.
Before next sync:
- Michael K: Unblock pprof labels patch and benchmarking trace overhead.
- Felix: I can contribute a worst case benchmark.
  - Currently blocked on pprof labels in trace.
- Felix: Started to work on gentraceback. Might work on it over the holidays.
  - Trying for bug-for-bug compatibility.
- Michael P: Austin has been working on this too.

felixge commented 1 year ago

I'll miss the Dec 22nd meetup because I'm traveling for the holidays. That being said, if I find time I might also look into https://github.com/golang/go/issues/57159 . Getting a proof of concept for Perfetto UI integration (ideally using their protocol buffer format) is probably more important than the gentraceback refactoring at this point. I just tried to work with a 300 MB (15s of prod activity) yesterday, and it was a real eye opener to the way the current UI struggles.

tbg commented 1 year ago

I don't know if it's relevant (probably nothing new for the folks on this thread), but I had similar problems with the go tool trace viewer where it would freeze on me all the time, esp. in the per-goroutine view (/trace?goid=N). I figured out you can download perfetto-compatible JSON data from /jsontrace?goid=N. (/jsontrace gives the default view). This can then be uploaded to ui.perfetto.dev. This doesn't show all the information in the trace so it's not as great, but I was glad to have something that worked.

thediveo commented 1 year ago

would the pprof labels also show up in goroutine traces?

qmuntal commented 1 year ago

I'm working on a PoC that improves native stack unwinding on Windows by adding additional information to the PE file. This will help debugging with WinDbg and profiling with Windows Performance Analyzer. Would this work fit into the effort tracked by this issue?

mknyszek commented 1 year ago

@thediveo I think that might be a good question for #56295, or you could file another issue. Off the top of my head, that doesn't sound like it would be too difficult to do.

@qmuntal Oh neat! That's awesome. I think it's a little tangential to the work we're proposing here, unless you also plan to do anything with the runtime's unwinder (i.e. gentraceback). Then again, if one of the goals is better integration with the Windows Performance Analyzer that's certainly more in the same spirit. Do you have an issue for tracking that already?

qmuntal commented 1 year ago

Do you have an issue for tracking that already?

I still have to prepare the proposal, I plan to submit it next week.

unless you also plan to do anything with the runtime's unwinder (i.e. gentraceback).

Not for now, but once I finish this I want to investigate how feasible is too unwind native code and merge it with the Go unwinding, in case the exception happens in a non-Go module.

qmuntal commented 1 year ago

Do you have an issue for tracking that already?

I do now #57302 😄

gopherbot commented 1 year ago

Change https://go.dev/cl/459095 mentions this issue: sweet: add support for execution traces and measuring trace overhead

mknyszek commented 1 year ago

2022-12-22 Sync

Attendees: @mknyszek @aclements @prattmic @bboreham @rhysh @dominikh

Organizational stuff
- OK to record meetings?
- Meeting recorded with transcript this week (please ask if you would like to see it).

Trace overhead benchmarks

https://go.dev/cl/459095


name                                old time/op            new time/op            delta
BiogoIgor                                      17.7s ± 3%             17.5s ± 4%     ~     (p=0.190 n=10+10)
BiogoKrishna                                   15.1s ± 4%             15.1s ± 4%     ~     (p=0.739 n=10+10)
BleveIndexBatch100                             5.78s ± 7%             5.76s ±11%     ~     (p=0.853 n=10+10)
BleveQuery                                     2.37s ± 0%             2.37s ± 0%   -0.26%  (p=0.016 n=8+10)
FoglemanFauxGLRenderRotateBoat                 16.9s ± 9%             16.9s ± 7%     ~     (p=0.796 n=10+10)
FoglemanPathTraceRenderGopherIter1             36.7s ± 1%             44.4s ± 2%  +21.01%  (p=0.000 n=10+10)
GoBuildKubelet                                 47.0s ± 2%             48.8s ± 3%   +3.72%  (p=0.000 n=10+10)
GoBuildKubeletLink                             8.89s ± 2%             8.88s ± 4%     ~     (p=0.720 n=10+9)
GoBuildIstioctl                                45.9s ± 1%             47.8s ± 2%   +4.09%  (p=0.000 n=10+10)
GoBuildIstioctlLink                            9.07s ± 2%             8.99s ± 2%     ~     (p=0.095 n=10+9)
GoBuildFrontend                                15.7s ± 4%             16.1s ± 2%   +2.45%  (p=0.043 n=10+10)
GoBuildFrontendLink                            1.38s ± 2%             1.37s ± 3%     ~     (p=0.529 n=10+10)
GopherLuaKNucleotide                           27.9s ± 0%             27.9s ± 1%     ~     (p=0.853 n=10+10)
MarkdownRenderXHTML                            256ms ± 2%             256ms ± 2%     ~     (p=1.000 n=9+9)
Tile38WithinCircle100kmRequest                 618µs ± 7%             657µs ±10%   +6.30%  (p=0.015 n=10+10)
Tile38IntersectsCircle100kmRequest             722µs ± 6%             773µs ± 4%   +6.96%  (p=0.000 n=10+9)
Tile38KNearestLimit100Request                  508µs ± 3%             532µs ± 3%   +4.73%  (p=0.000 n=10+10)

name old average-RSS-bytes new average-RSS-bytes delta BiogoIgor 68.8MB ± 2% 71.8MB ± 4% +4.40% (p=0.000 n=10+10) BiogoKrishna 4.42GB ± 0% 4.42GB ± 0% ~ (p=0.739 n=10+10) BleveIndexBatch100 194MB ± 2% 198MB ± 3% +1.91% (p=0.008 n=9+10) BleveQuery 536MB ± 0% 537MB ± 1% ~ (p=0.190 n=10+10) FoglemanFauxGLRenderRotateBoat 444MB ± 1% 446MB ± 0% +0.41% (p=0.035 n=10+9) FoglemanPathTraceRenderGopherIter1 132MB ± 1% 142MB ± 4% +7.61% (p=0.000 n=10+10) GoBuildKubelet 1.75GB ± 1% 1.85GB ± 1% +5.51% (p=0.000 n=10+10) GoBuildIstioctl 1.35GB ± 1% 1.42GB ± 1% +5.49% (p=0.000 n=10+9) GoBuildFrontend 511MB ± 2% 543MB ± 1% +6.31% (p=0.000 n=10+9) GopherLuaKNucleotide 37.0MB ± 1% 40.4MB ± 2% +9.24% (p=0.000 n=9+10) MarkdownRenderXHTML 21.8MB ± 3% 24.0MB ± 3% +10.14% (p=0.000 n=9+8) Tile38WithinCircle100kmRequest 5.40GB ± 1% 5.38GB ± 1% ~ (p=0.315 n=10+10) Tile38IntersectsCircle100kmRequest 5.72GB ± 1% 5.71GB ± 1% ~ (p=0.971 n=10+10) Tile38KNearestLimit100Request 7.26GB ± 0% 7.25GB ± 0% ~ (p=0.739 n=10+10)

name old peak-RSS-bytes new peak-RSS-bytes delta BiogoIgor 95.9MB ± 4% 98.5MB ± 3% +2.70% (p=0.030 n=10+10) BiogoKrishna 4.49GB ± 0% 4.49GB ± 0% ~ (p=0.356 n=9+10) BleveIndexBatch100 282MB ± 3% 284MB ± 4% ~ (p=0.436 n=10+10) BleveQuery 537MB ± 0% 538MB ± 1% ~ (p=0.579 n=10+10) FoglemanFauxGLRenderRotateBoat 485MB ± 1% 483MB ± 0% ~ (p=0.388 n=10+9) FoglemanPathTraceRenderGopherIter1 180MB ± 2% 193MB ± 3% +7.19% (p=0.000 n=10+10) GopherLuaKNucleotide 39.8MB ± 3% 46.0MB ±20% +15.56% (p=0.000 n=9+10) MarkdownRenderXHTML 22.1MB ± 3% 25.5MB ± 7% +15.45% (p=0.000 n=9+10) Tile38WithinCircle100kmRequest 5.70GB ± 1% 5.68GB ± 1% -0.45% (p=0.023 n=10+10) Tile38IntersectsCircle100kmRequest 5.93GB ± 1% 5.91GB ± 2% ~ (p=0.631 n=10+10) Tile38KNearestLimit100Request 7.47GB ± 1% 7.46GB ± 0% ~ (p=0.579 n=10+10)

name old peak-VM-bytes new peak-VM-bytes delta BiogoIgor 802MB ± 0% 803MB ± 0% +0.11% (p=0.000 n=10+10) BiogoKrishna 5.24GB ± 0% 5.24GB ± 0% +0.01% (p=0.001 n=10+10) BleveIndexBatch100 1.79GB ± 0% 1.79GB ± 0% +0.05% (p=0.000 n=8+8) BleveQuery 3.53GB ± 1% 3.53GB ± 1% ~ (p=0.237 n=10+10) FoglemanFauxGLRenderRotateBoat 1.21GB ± 0% 1.16GB ± 4% ~ (p=0.163 n=8+10) FoglemanPathTraceRenderGopherIter1 875MB ± 0% 884MB ± 0% +1.02% (p=0.000 n=10+10) GopherLuaKNucleotide 733MB ± 0% 734MB ± 0% +0.11% (p=0.000 n=9+10) MarkdownRenderXHTML 733MB ± 0% 734MB ± 0% +0.10% (p=0.000 n=10+9) Tile38WithinCircle100kmRequest 6.42GB ± 0% 6.39GB ± 1% ~ (p=0.086 n=8+10) Tile38IntersectsCircle100kmRequest 6.62GB ± 1% 6.61GB ± 2% ~ (p=0.927 n=10+10) Tile38KNearestLimit100Request 8.16GB ± 1% 8.18GB ± 0% ~ (p=0.649 n=10+8)

name old p50-latency-ns new p50-latency-ns delta Tile38WithinCircle100kmRequest 144k ± 3% 159k ± 3% +10.56% (p=0.000 n=9+9) Tile38IntersectsCircle100kmRequest 215k ± 1% 232k ± 2% +7.91% (p=0.000 n=9+10) Tile38KNearestLimit100Request 347k ± 2% 373k ± 1% +7.21% (p=0.000 n=10+10)

name old p90-latency-ns new p90-latency-ns delta Tile38WithinCircle100kmRequest 908k ± 6% 956k ± 9% +5.22% (p=0.043 n=10+10) Tile38IntersectsCircle100kmRequest 1.07M ± 4% 1.11M ± 5% +4.33% (p=0.001 n=10+10) Tile38KNearestLimit100Request 1.03M ± 3% 1.05M ± 4% +2.64% (p=0.011 n=10+10)

name old p99-latency-ns new p99-latency-ns delta Tile38WithinCircle100kmRequest 7.55M ± 9% 7.93M ±13% ~ (p=0.089 n=10+10) Tile38IntersectsCircle100kmRequest 7.81M ± 8% 8.39M ± 2% +7.36% (p=0.000 n=10+8) Tile38KNearestLimit100Request 2.03M ± 4% 2.08M ± 5% +2.52% (p=0.019 n=10+10)

name old ops/s new ops/s delta Tile38WithinCircle100kmRequest 9.73k ± 7% 9.16k ±11% -5.83% (p=0.015 n=10+10) Tile38IntersectsCircle100kmRequest 8.31k ± 6% 7.77k ± 4% -6.55% (p=0.000 n=10+9) Tile38KNearestLimit100Request 11.8k ± 3% 11.3k ± 3% -4.51% (p=0.000 n=10+10)


* Introduction: Bryan Boreham, Grafana Labs
    * Questions within the team about whether useful information has been derived from Go execution traces.
    * Phlare: continuous profiling. Interested in linking together various signals (distributed tracing, profiling)
    * Michael K: Interesting data point about usability.
    * Michael P: Hard to link application behavior to trace.
    * Bryan: Example: channels. Still don't really know where to find that data.
    * Dominik: One of the reasons I started on gotraceui was to surface more information and do more automatic inference and analysis of the data.
    * Rhys: Execution trace technique: get data out of them to find the interesting traces. Try to extract features that would be interesting up-front.
        * Starts with internal trace parser. Have code to find start and end of HTTP requests, DNS lookups, etc.
        * Tooling on the way to get open sourced.
* Heap analysis plan (#57447)
    * Austin: Additional context is we're confident in the API we're planning to export, as opposed to tracing which we have nothing for yet.
* https://go.dev/issue/57307 proposal: cmd/trace: visualize time taken by syscall
    * Austin: Does Perfetto do better with instantaneous events?
        * Michael P: Yes, there's a 20px wide arrow but we have so many.
        * Rhys: Hold shift, draw a box. If you aim well, you get what you want.
    * Rhys: Why is there only one timestamp on some events?
        * Austin: We can add another timestamp.
        * Michael P: Syscall fast path does a lot less.
* pprof labels in traces
    * Michael K: I think I've unblocked Nick. Michael and I are reviewing.
* `runtime.gentraceback` cleanup
    * Austin: Back and forth on the issue about making it an iterator, sent out CLs, not tested yet.
* Next meeting: Jan 5th, Michael P and Michael K won't be here, so Austin will run it.
* Action items:
    * We're slowing down for the holidays, so no strong expectations
    * Michael K:
        * Try to land execution trace benchmarking.
        * Might look into heap analysis stuff.
        * After break, might want to start working on trace format more seriously.
* Happy holidays!

mknyszek commented 1 year ago

2023-01-05 Sync

Attendees: @aclements @felixge @nsrip-dd @rhysh @bboreham vnedkov @dashpole

Organizational stuff
- @mknyszek is out today, @aclements running the meeting.
- @dashpole from Google OSS Telemetry joining us.
Benchmarks: Can we add a goroutine ping pong example? (Felix)
- Tracer benchmarks all show relatively low overhead. Can we add a benchmark that demonstrates the worst case?
- Austin: Sweet probably isn’t the right place because that’s application-level. Maybe add to Bent?
- Felix: Next step on these benchmarks? Land MK’s trace benchmark support?
- Austin: It’s certainly fine to land. We don’t have a good way to integrate these “extra dimensions” into our automated benchmarking.
- AI(austin): Bring up monitoring extra benchmarking dimensions.
- Austin: “Unit benchmarks” would be the perfect place for a ping pong benchmark (we already have one in the main repo), but we never quite got to integrating these into automated monitoring.
Are only GC STW recorded? Would it make sense to record other STW events (read metrics, goroutine profile, heap dump)? (Felix)
- Rhys: You get ProcStop events
- Austin: Yeah, you’re right that we trace high level GC STW events.
- Rhys: Currently the GC traces the “best case” STW, which can be really misleading.
- Austin: We could definitely have a “stopping the world” and a “world stopped”. Maybe don’t need that for start.
- Felix: That would be great. We’re investigating rare long STWs right now.
- Rhys: Starting the world can take a while. Problems with heap lock contention. I would love to have more visibility into the runtime locks.
- Austin: Runtime locks are a bit of a mess. I also wonder if they should be “scalable”.
- Rhys: I’d love to discuss that. C&R office hours?
- Austin: Perfect.
- Conclusion: Let’s add events for all STWs and also separate “stopping” from “stopped”.
Updates on Perfetto UI (Felix and Nick)
- Add to UI CL: https://go.dev/cl/457716
- Felix: The JSON currently produced by the trace tool is basically compatible with Perfetto. Doesn’t let us open really large traces without splitting, which was one of the hopes. And it takes a while to load. I was able to use the command line tools to load a 280MB trace into a 9.8GB JSON trace and load that in Perfetto, but it took 20 minutes. Nick has been working on outputting proto directly, which will hopefully produce less data than JSON.
- Rhys: When I tried this a while ago, the connection of data flow wasn’t quite right.
- Felix: This CL doesn’t fix that. I’m hoping it’s an upstream issue, which they’re pretty responsive to. I’m hoping protobuf will just make it go away, since that’s their canonical input.
- Nick: Stack traces seem to be missing from protobuf, which we definitely want. We might need upstream changes to support that.
- Felix: I suspect there may be some long tail of issues. But the initial plan would be to keep both viewers until we feel this is solid.
- Austin: How does the streaming work?
- Felix: They have an in-memory column store with a SQL interface on top of it. Large traces would still be a problem because they’d need to be loaded fully into memory.
- Austin: In principle we could swap out that column store for our own streaming thing, but that sounds like a significant amount of work.
- Felix: On Go Time someone said they only use runtime trace when they’re really desperate and then they can’t figure it out anyway. Most people don’t think about their program from the perspective of the scheduler. I’d like to have different pivoting, like one timeline per G (or M). We sort of have that in the goroutine analysis, but that only shows on-CPU time. Dominick did that in gotraceui.
Updates on pprof labels (Nick)
- Nick: In MK’s recent comments on pprof labels CL, he wondered about a size limit on labels being recorded in the trace. Thinking about trace overhead. Users can also add arbitrary logs (limited by trace buffer size). My thought is that users are deciding to make these as big or as small as they want.
- Austin: My reaction is “do what the user said”
- Rhys: It seems like we already don’t have a limit on the pprof labels (number/length/etc) and maybe it would have been good to have a limit, but we already don’t.
- Bryan: For me it’s more important to be able to find out how much damage you’re doing with this data. Inevitably people want one more byte than the limit and will be frustrated.
- Felix: Two sides to this problem: how to get the data in the trace while keeping overhead low, and the other is keeping the memory usage low for keeping all these labels. For trace overhead, I’m thinking we want two or three levels of filtering: filter what events, filter events by properties (e.g., duration). JFR supports both of these. And potentially a way to modify events (maybe too far), like truncation. At some point you can almost guarantee fixed-cost tracing. E.g., turn off everything except profile events; now you have timestamps on profile events without all the other overhead.
- Austin: MK and I have definitely been thinking in that direction. The current trace viewer is almost purpose-built for analyzing the scheduler and needs to understand how a lot of events relate. But if we open up reading traces, the trace viewer becomes just another tool and maybe it’s fine for it to say “I need these events” (kind of like “perf sched” or similar).
- Felix: I can ask my Java colleagues about how this works in JFR.
- Rhys: Curious how you’re thinking about filtering.
- Felix: Simpler is better. You could imagine a callback, but that’s not simple. Probably something like runtime/metrics where you can discover the events and select.
- Rhys: Definitely need a header saying which events are included.
- Felix: Agreed. Also nice for viewers so they don’t have to hard-code all of the events.

mknyszek commented 1 year ago

2023-01-19 Sync

Attendees: @aclements @felixge @nsrip-dd @rhysh @bboreham @mknyszek @prattmic @dominikh @dashpole

Felix: gentraceback iterator refactoring
- Felix: What's the progress?
- Austin: Made progress. Running into issues with write barriers and trying to knock down all the write barriers one by one. Big open question of testing; so many horrible corner cases. No good answers.
- Felix: Tried to do it incrementally instead of all at once; also painful. RE: testing, would it be useful to have the ability to instrument a PC and do a traceback from there?
- Austin: That would help. The worst parts are weird though, like signals. If we had a good way to inject a symbol, like a breakpoint, that would help a lot.
  - Idea: could use hardware breakpoints via perf-event-open (Linux only, but at least architecture-independent) which could get enough coverage for Austin to be happy.
  - Could potentially synthesize other signal tests from a single signal.
- Felix: I'll give it a shot.
- Michael K: What work could we do in parallel?
- Felix: Could write a frame pointer unwinder separately for tracing just to get an idea of the overhead.
  - Austin: +1. Tricky things include logic in gentraceback for filtering out frames. Maybe it doesn't matter for the trace viewer (i.e. don't filter). Also inline unwinding. Trying to totally separate inline unwinding in gentraceback. Once its its own separate thing, it'd be straightforward to plumb that into a frame pointer unwinder.
  - Michael K: Could we skip inline unwinding for the experiment?
  - Austin: Yeah.
  - Michael P: +1 to separating out inline unwinding. Already "runtime_expandFinalInlineFrame" in the runtime which is a good reference point for this.
  - Felix: Also all the complexity with cgo traceback, but we should just ignore that for the experiment.
  - Michael K: The cgo traceback tests are also really flaky, and if we could have better testing around that that would be great.
Felix: Perfetto UI blues … (timeline bug, link bug, stack traces, large traces, small screens, protocol buffer format) … gotraceui w/ wasm? Having an online tool with independent release cycle is tempting?
- CL out that makes Perfetto work. Limitations:
  - Limited for very large traces as-is.
  - Doesn't seem easy to make it work as well as go tool trace (bugs). e.g. timelines not named correctly. Events not connected correctly.
    - Harder: getting stack traces to show up. Nick has tried to make it work. Protobuf format doesn't have an obvious stack trace format?
    - Nick: Not a one-to-one mapping between Catapult format and Perfetto. Can stick a single location in the Perfetto format, but not a full stack trace. Little things in the protobuf format that aren't well-documented. e.g. string interning only works if you include a number in the header.
    - Michael K: MP and I looked into this. Perfetto knows how to do this for some traces, but it’s built into a C++ library, so we’d have to rewrite that in Go or call into it from Go. I’m not sure it even has strong backwards compatibility.
    - Michael P: There is the Perfetto tool that runs the RPC server. (trace_processor.) That loads into a SQLite in-memory DB, but does do better than the fully in-browser implementation. It can do bigger traces, though is still limited. That seems like enough of an improvement to me.
    - Felix: I have a 280MB trace that gets split into 90 parts for 15 seconds on a busy server. Maybe we should start with deciding what size trace we want to have a good experience for.
    - Michael K: I think 280MB is a big trace, though it’s only 15 seconds. I think we should be targeting bigger than that. It’s easy to get a 1GB trace. But we can start with Perfetto as long as it’s better and work toward that.
    - Austin: Is that better with Perfetto?
    - Felix: I think it would be better. Maybe 5x better, so a second at a time (don’t quote me on that).
    - Michael P: The trace_processsor is better, but still limited by the in-memory SQLite DB. Presumably that could be on disk. I don’t know if the trace loading is also linear in the trace size.
    - Rhys: What do you even do with an execution trace that large? How do you get value out of that?
    - Felix: This trace was from a colleague from an instance that was struggling with pauses. It looked like a straggling procstop. It was debugging the behavior of a whole application that was behaving poorly.
    - Rhys: So you were looking for behavior that was pretty zoomed-out.
    - Felix: Yeah.
    - Michael K: Part of the problem with existing traces is the usability of this. I think it’s a valid question about whether big traces are all that useful. Sometimes you’re not even really sure what you’re looking for. Say I wanted to run a full trace on every invocation of the compiler. You don’t necessarily know what you’re looking for to improve compiler speed.
    - Austin: I bet if you were to profile the space of large trace file, the vast majority of that would not be useful to you looking at it at a high level. Suggests a solution here for filtering is to just reduce what goes into the trace.
    - 280MB Trace Size Breakdown
    - Michael K: Maybe just proc start/proc stop for what Felix was describing.
    - Rhys: But once you find the problem, you want more detail. It's hard to catch the end of a garbage collection cycle because of the rules of starting a trace during a GC cycle.
    - Michael K: Fixing the mark phase issue should be easier than before.
    - Austin: Awesome breakdown!
  - User group said "please don't do this" because Perfetto isn't nice to small screens.
  - Felix: gotraceui
    - Viewing timelines for goroutines is great.
    - Would like Dominik to talk about gotraceui some more.
    - I want to be intentional about choosing Perfetto.
    - Michael K: I think the dependency on gio was a concern.
    - Dominik: Gio (the UI library I use) supports wasm, so it should be fairly straightforward to have gotraceui run in the browser if we want to go down that road.
    - Dominik: I still rely on loading entire traces into memory (but using significantly less memory than current go tool trace), but with the upcoming format changes, streaming data might be possible. We currently load everything into memory because when the user zooms out far enough, we need all events to compute what we display. But we could probably precompute these zoom levels, similar to mipmaps.
    - Dominik: For the current trace format, gotraceui needs roughly 30x the size of the trace in memory. so a 300 MB trace needs 9 GB.
    - Michael K: I have been thinking about an HTML UI that does something like Google Maps tiles to scale. We could skip a lot of work if we could take gotraceui as the UI, but port it into something more portable than Gio. OTOH, it’s even more work to build something from scratch.
    - Dominik: WRT gotraceui's use of Gio, there'll be pretty rich UI, and I don't fancy writing UIs in HTML/JS. But all of the processing of trace data could live externally
    - Michael P: It’s not necessarily a hard requirement that the Go project itself ship a trace viewer. We have to now because there’s no API. But if we shipped an API, it wouldn’t be a hard requirement. Much like we don’t ship a debugger.
    - Michael K: One option is that we ignore the UI situation entirely and build something that you can parse separately and ship something really bare later. In the meantime, point at a little tool that will shove it into trace_processor and point people at Perfetto. For a brief time, stop shipping our own. It’s very convenient that you only need a Go installation to view these traces, but I think you’re right that we could stop shipping a UI. We could also keep the existing UI working/limping while we do other things in parallel.
    - Felix: Is Dominik looking for contributors? (That comes with its own overheads)
    - Dominik: I'm usually not big on contributions in the form of code; but ideas and feedback are hugely appreciated
    - Michael K: We don’t have to make a decision on using Perfetto now. Maybe we should plug along for two more weeks (with Perfetto) and figure out if we can fix the issues without too much effort, and then make a hard decision on what to do at the next meeting.
    - 👍
Felix: traceutils anonymize & breakdown and ideas: (flamescope, graphviz, tracer overhead)
- Implemented anonymization of traces. Breakdowns, too.
- Tracer overhead tool that uses profile samples in the trace to identify overheads.
Felix: Format: Consistent message framing, remove varint padding for stacks
- 4 different cases for how an event can be laid out.
- Maybe a way to skip messages and layouts it doesn't understand.
- Austin: self-descriptive header giving lengths for each opcode
- Michael K: Any state in the trace makes things hard to push it up into OTel, since that’s completely stateless.
- Felix: We’re actually trying to do two things in OTel. Including binary data blobs, like pprof and JFRs. And something to send stateful things like stack traces, etc, where you can refer back to them efficiently.
- David: For trace I wouldn’t expect a stateful protocol to be introduced any time soon. But for profiling it may be a possibility.

mknyszek commented 1 year ago

2023-02-02 Sync

Attendees: @aclements @felixge @nsrip-dd @thepudds @bboreham @dashpole @mknyszek @prattmic

Felix: Discuss results from frame pointer unwinding experiments (blog, sweet results) and next steps
- Targeted ping-pong example, worst case. Worth noting that the stack depth in that benchmark is 2. Went from +773% -> +30%, apparently doing 50% more work too!
- Sweet: 10% -> 2% overhead!
- Michael K: Michael P mentioned missed cases.
- Michael P: Inlined frames are one example. Maybe we just accept slightly less accurate traces in the tracer.
- Austin: +1 to missing inlined frames, but we can also expand that after the fact.
- Michael K: Do you need the binary for that?
- Austin: Today, yes.
- Felix: The tracer already de-duplicates stack traces. If we do inline expansion at the end, there's probably not that much work to do.
- Michael P: Other avenue, do we need stack traces on every event? Maybe remove stack traces for some events?
- Michael K: Where does the rest of the time go?
- Felix: In the blog post. Frame pointer unwinding is only 9% of the trace overhead. 28% is cputicks. 21% is stack put.
- Austin: Shocked that cputicks is 28%. It's one instruction. I guess that's a good sign?
- Austin: (FP unwinding is also relevant for #53286. In that case it’s the kernel’s FP unwinder, but it means our FP data is going to have to be high quality for both.)
- Thepudds: Or maybe an option for sampling of stack traces?
- Michael K: I think it depends. As traces are used today, you probably want 100% sampling. For larger scale aggregation, I think it's a solid option.
- Michael K: Dream of nanotime to line up clocks.
- Austin: It might not be that bad. RDTSC is serializing so the extra math in nanotime might not make much of a difference in overhead.
- Michael K: We should definitely pursue this, at least for tracing.
- Felix: The prototype is missing inline expansion, support for SetCgoTraceback (Go -> C -> Go), and dragons in the compiler where the FP isn't on the stack when it should be. Previous implementation hit this and I suspect I hit this as well.
- Austin: Status of FPs is better than it once was. Saving grace of the tracer is you often don't have an assembly frame on the stack. Talked about making vet complain if you clobber the frame pointer in assembly code. Would be surprised if there are problems in the compiler generated code; worry much more about assembly.
- Felix: Worried about stack shrinking / relocation. Growing shouldn't happen while in unwinding, but not sure about shrinking.
- Austin: I think you always see a fully formed stack.
- Felix: There's no chance of seeing the stack mid-move?
- Austin: The goroutine that's getting moved has to be stopped.
- Nick: If unwinding happens asynchronously then it's a problem, like CPU profiling. We could use gentraceback in the difficult cases.
- Felix: Plan on working on better unwind testing. That machinery could be used to harden frame pointer unwinding as well.
- Michael K and Austin: Not a blocker to have the testing.
- Austin: FP on x86 is specified as part of the Go internal ABI. If the compiler is messing that up that's a violation of the ABI and definitely a bug. Doesn't apply to hand-written assembly.
- thepudds: One of the older CLs mentioned its approach depended on the stack not being copied while walking the frames, along with the comment “currently ok, but won't be if we preempt at loop backedges”... but maybe that old concern is not a current concern....
- Michael K: I think loop backedges aren't a concern, and async preemption as it exists shouldn't be an issue.
- Michael P: Traceback itself would just disable preemption just for consistency, but just because it's in the runtime package, we won't ever async preempt.
- Austin: I'm not sure why loop backedges would be a concern.
- Michael K: I don't think we should block on inline expansion, but maybe cgo tracebacks.
- Austin: As an intermediate step, use gentraceback for if there’s a cgoTracebacker and cgo on the stack. Will work for 99% of our users.
- Felix: Like the idea of making it the new default, but with the ability to switch back.
- Michael K: We could add a GODEBUG flag
Felix: Flight recorder / ring buffer mode
- Felix: We’d like to capture traces of slow spans. Wait for a p99 response and then get the last N MB of trace. I’m currently working on an experiment to see if this can be done in user space.
- Michael K: I think finding the oldest batch is an O(N) operation. Ordering the batches is difficult because we assume everything will show up eventually.
- Austin: The tracer is really stateful, so it's really difficult to actually manage a ring buffer. debuglog is a ring buffer, and what it does is consume its own format in order to manage a snapshot of the state.
- Felix: I’d be okay with getting a non-perfect trace at the end. At least understand what the goroutines are doing. Maybe we could every once in a while emit a “synchronization” event. If a complete redesign of the format is required, [flight recorded mode] is something we’d be interested in.
- Michael K: I’d like to find out what the overhead of writing the trace is. Say you have no stack traces, where is the rest of the time going? That’s important information for redesigning the trace format. I’ve already been thinking about redesigning the format. At the cost of using more space, it has to end up less stateful. Regularly synchronizing is one way to do that. That’s kind of where I was going: a “trace” is really a collection of self-contained traces. With the tooling able to be more resilient at the edges. Synchronation wouldn’t necessarily be STW, but you have a ragged barrier across the Ps that sync them all to the next trace chunk. That gets complicated in a ring buffer. I was thinking of gathering the requirements for a new trace format. Because there’s so much state, it’s hard to make it completely stateless without ballooning the trace.
- Felix: JFR does that ... splitting the stream up into self-contained chunks.
- Michael K: We’re definitely on the same page [wrt flight recorder]. The Go team arrived at this, too. We’re also trying to make ELF core dumps the source of truth for heap analysis. Ideally we’d be able to pull the ring buffer out of a core dump so you can see exactly what was happening before crashing.

qmuntal commented 1 year ago

Felix: The prototype is missing inline expansion, support for SetCgoTraceback (Go -> C -> Go), and dragons in the compiler where the FP isn't on the stack when it should be. Previous implementation hit this and I suspect I hit this as well.

FYI: #57302 is hitting this as well, as I'm implementing SEH unwinding using the frame pointer. Whichever is the fix for that, would be good to take SEH also into account.

mknyszek commented 1 year ago

2023-02-16 Sync

Attendees: @mknyszek @aclements @felixge @nsrip-dd @prattmic @dominikh @thepudds @pmbauer @dashpole @rhysh

468301: runtime: delete gentraceback
- Austin: Needs more testing.
- Austin: Nice things to do as a result, listed in the issue. e.g.
  - Simpler defer processing
  - CPU profiles have a low limit on frames it'll capture.
    - Iterator makes this much more tenable to fix.
  - Years-old squirrely bug in the race detector.
- Felix: I’m happy to look into testing using perf, but I’m not sure when I can get to it.
- Rhys: If there are more frames than you want to record, could you add some context by including N outermost frames and M innermost frames. Maybe a “runtime._Elided” frame in the middle.
- Michael P: We’ve thought about doing that for panic tracebacks.
463835: runtime: frame pointer unwinding for tracer Felix Geisendörfer: wip, but almost ready for review
- Are slight stack trace differences acceptable?
  - Michael K: I think that’s fine. As we move toward letting people parse the format, I think lining up traces with stacks from other sources could become more of a problem.
  - Felix: The current patch passes most of the tests of tracebacks in traces.
- Should it use an unwinder interface similar to austin’s patches?
- Could systemstack be changed to push frame pointers? Otherwise the caller frame is lost. Naive attempts to make this change caused crashes.
  - Austin: Yes please.
- Weird issue with syscalls on BSDs losing a frame.
  - Austin: That’s probably lazy assembly.
  - Felix: Another option is to only enable FP unwinding on Linux for now.
  - Austin: As long as it works on Linux, Windows, and Darwin I’m happy.
- Cgo unwinders
  - Austin: It’s fine to take the slow path if the current goroutine has cgo frames and there’s a cgo unwinder.
- Felix: I got inlining to work (when traces are finalized). Benchmark numbers are still holding.
Michael K: Once all of the backtrace stuff is settled, I want to try using the monotonic clock (nanotime) rather than CPU ticks.
- Nick: Could you record nanotime at the beginning of a batch and then CPU ticks after that.
- Michael P: To do that safely, you’d need to know when you migrate CPUs. Linux’s restartable sequences can get you that.
- Michael K: There might not be a performance gap between nanotime and cputicks.
- Austin: If there’s even a performance gap, you could push more of the nanotime computation into the trace reader.
```
$ benchstat -col '.name@(CPUTicks Nanotime)' /tmp/bench
goos: linux
goarch: amd64
pkg: runtime
cpu: 11th Gen Intel(R) Core(TM) i7-1185G7 @ 3.00GHz
    │  CPUTicks   │              Nanotime               │
    │   sec/op    │   sec/op     vs base                │
*-8   10.75n ± 0%   16.11n ± 0%  +49.88% (p=0.000 n=20)
```
runtime: copystack doesn't adjust frame pointers on arm64 · Issue #58432 Felix Geisendörfer
- It was relatively easy to fix once I understood what was going on, but there appear to be dragons there.
- Boolean in the runtime does a double check of FPs on stack copies.
- Would like to treat arm64 as a separate issue, but I plan to get to it.
460541: runtime: reduce sysmon goroutine preemption (Felix Geisendörfer)
- Michael P: There are likely issues here with special cases in the scheduler. Not sure they're easy to fix.
cmd/pprof: macOS 12.6.1 (M1) profile overcounts system calls (again) #57722 (Felix Geisendörfer)
- Michael P: C reproducer and handing off to Apple (if it works) seems like a reasonable next step. No guarantee we'll get a fix though.
proposal: runtime: add per-goroutine CPU stats · Issue #41554 (Felix Geisendörfer)
- Felix: Initial justification was along the lines of billing, which seems better served by pprof. Then it shifted to fast control loops to throttle users. It seems better to have scheduling priorities, but barring that it seems good to let user space do priorities.
- Michael P: We’ve been discussing having tracing that’s cheap enough to have on all the time, and a parsing library. Maybe a user could do this by enabling tracing and parsing their own trace. Is this generally the right approach to user throttling at all?
- Rhys: I think a cheap trace that can be parsed in the app is good and flexible. I’m not sure per-goroutine stats is the right approach. E.g., if I use the net/http client, there are a bunch of goroutines involved that I don’t control but I want to understand the latency of.
- Felix: One trade-off of the trace route is the latency of reading your own trace.
- Rhys: It would be useful if the app could say, “I need a ragged barrier ASAP and I’m willing to take some performance hit.”
- Michael K: The other complication is how fast we can make the parsing. That might add unacceptable latency.
- Felix: I think the “explain analyze” case is not the best example. The most difficult is trying to throttle a user of the database that’s doing something you don’t want. In that case you don’t know ahead of time, so you’d be doing the ragged barrier all the time.
- Michael P: I think that’s a good argument for actual priorities in the scheduler. If you have some background goroutine watching for bad behavior, that might not get scheduled if there’s bad behavior.
- Austin: Swirling around problems that people have been thinking about for decades. Would love to see a summary of the current state-of-the-art is here.
- Michael K: Probably only OS APIs.
- Austin: That's not a bad thing. If it's a good API, we can consider replicating it.
AIs
- Michael K: Writing down trace requirements in earnest
- Michael K: Testing for x/debug
- Michael P: Need to review Austin's CL stack.
- Michael P: debug/gosym proposal.
- Felix: Clean up the tracer FP unwind patch (for amd64) to get it ready for review.
- Austin: Try to keep moving along gentraceback stack. Think about test-hook-coverage aspect.

mknyszek commented 1 year ago

2023-03-02 Sync

Attendees: @mknyszek @prattmic @felixge @nsrip-dd @aclements @thepudds @rhysh @bboreham

Michael K: I'm 70% of the way to a trace v2 (producer, consumer, trace format), and 40% of the way to writing it up.
- Most of the time is being spent detangling the existing tracer, documenting it, and using that to justify next decisions. Hopefully I'll have a draft to share before next time.
- [Michael K proceeds to go into way too much detail about this. Highlights below. A public document will follow.]
  - Let's use the system clock (e.g. clock_gettime) instead of RDTSC (for a number of reasons).
  - There are a very small number of places where you really need to understand the exact order of events. The current tracer takes advantage of that and I believe we need to retain this. Timestamps aren't enough.
  - Attach traces to Ms, not Ps. There’s a lot of complexity around GoSysExit racing with trace start. Thinking about ragged start and making the parser robust to that.
    - This choice forces us into avoiding a stop-the-world.
  - Trace binary format ended up being more about consumer efficiency than producer efficiency, but still more efficient on both sides.
    - Traces will be partitioned for streaming. Each partition is fully self-contained with a set of stacks and strings.
    - Trace events are sequences of 4-byte words whose internal structure respects byte boundaries and field alignment, to allow encoding/decoding events to just be memcpys and state management.
    - Using Felix's 280 MiB trace breakdown as a motivating example. By my calculations the current design woukld use around 10% more. Personally that seems acceptable for the other gains.
    - Every G event has an explicit G ID, but it's derived from a "G context" event. G IDs are also compressed.
- Michael K: We could make the stack table faster by only checking the hash instead of an exact match. Small chance of error.
- Rhys: Let's be cautious about making sure that traces actually work.
- Michael K: That's a good point. We should put an explicit bound on the likelihood of error. If it's astronomically small, is that fine?
- Rhys: Astronomical is good.
- Rhys: Would the new trace format still enumerate every goroutine? Currently can get stuck in many-millisecond STW waiting for tracing to enumerate all goroutines.
- Michael K: My plan was no. Goroutine profiles if you want that?
- Rhys: That's good. Yeah, you should be able to correlate a goroutine profile with a corresponding STW event in a trace. Happy about no STW in general for traces too.
- Rhys: RE: correlating things with traces, do we want to keep things orthogonal in general? Thinking about CPU profile events in traces.
- Michael P: I see where you're coming from in that you might want just the CPU profile events from a trace (with timestamps) and it's weird to get the whole trace and throw most of it away. We discussed having an API for configuring the trace and which events get emitted, so that might be a good place for that.
- Austin: There's a relevant issue about making CPU profiles more configurable as well, so maybe that's a good place for it too?
- Michael P: I think there are a lot of API questions here. Do you configure CPU profile in tracing at the CPU profile side or at the tracing side? The most natural way sounds like the tracing side because that's your actual output format, but I'm not sure. And then it gets complicated if you turn on CPU profiling in the tracing API and then you separately turn on CPU profiling, is that allowed? Right now you can't turn on profiling twice. And that's even more complicated, if we let you figure the sampling rate and they're not the same.
- Rhys: One of the difficulties that I've had in using execution traces and CPU profiles at the same time is that even though the CPU profile doesn't exactly stream its output while it's going. It's tricky to juggle two different output formats. At the same time that I'm trying to put into a single zip file to upload to blob storage. A single buffer would be handy.
- Michael P: A single buffer is ideal, but we don't have a converter that could pull a CPU profile out of a trace. We're missing information.
- Rhys: For one, we're definitely missing goroutine labels, though there's a patch out for that. We're also missing /proc/<pid>/maps for binary/symbol information.
- Austin: It occurs to me that Linux perf basically starts with /proc/<pid>/maps.
- Michael P: Perhaps we should also dump build information. We've been brainstorming about including this information for PGO.
- Michael K: There's room for as much as we want at the beginning of the trace, basically, so I'm all for adding more there.
Michael K: I have also have a rougher draft of a trace parser API, with input from Michael Pratt.
- Felix: Would the old trace format fit in the new parser?
- Michael K: That was my goal. We'd have to do some retrofitting, but the old parser already exists. Caveat: parsing old traces would still have the same overall properties as the trace parser currently does.
Felix: Frame pointer unwinding patch for tracer is ready to review. It’s only amd64 for now and a bit rough around the edges. We should discuss what needs to be done before landing. Cgo is still missing, but I’m working on adding that.

dominikh commented 1 year ago

Traces will be partitioned for streaming. Each partition is fully self-contained with a set of stacks and strings

Does this include the current state of all (relevant) goroutines? The current parser is essentially a state machine and we need to see all previous events to reconstruct a global timeline. I don't see that going away with the new format.

Michael K: I have also have a rougher draft of a trace parser API, with input from Michael Pratt.

I'd encourage you to take a look at https://github.com/dominikh/gotraceui/blob/04107aeaa72e30c50bb6d10e9f2b6ca384fafc3d/trace/parser.go#L18-L77 for the data layout I've chosen in gotraceui. It's nothing groundbreaking, but it highlights the need to avoid the use of pointers.

mknyszek commented 1 year ago

Traces will be partitioned for streaming. Each partition is fully self-contained with a set of stacks and strings

Does this include the current state of all (relevant) goroutines? The current parser is essentially a state machine and we need to see all previous events to reconstruct a global timeline. I don't see that going away with the new format.

It does not. It only cares about the initial state of all Ms (including goroutines running on them), and generally only mentions goroutines that actually emit events. For goroutines that aren't running, there are only two cases where we actually care about the initial state of a goroutine: whether it was blocked, or whether it was waiting. In both cases it's straightforward to infer the state of the goroutine from the events that must happen to transition goroutines out of these states: unblocking and starting to run.

The trace still needs to indicate if a goroutine (and M) is in a syscall or if it's running. In the new design, this information is emitted together at the first call into the tracer by that M for that partition. The timestamp needs to be back-dated to the start of the partition. There's some imprecision with this back-dating but it's only relevant at the very start of a trace. The worst case is that a goroutine may appear to have been running or in a syscall at the start of a trace for longer than it actually was. The amount of imprecision here is bounded by the time delta between the global (serialized) declaration of a new partition and when an M has it's buffer flushed and/or is notified (via an atomic) that tracing has started, which I expect in general to be very short and non-blocking. (We can also explicitly bound the time by telling the M what time it was contacted for a new partition.)

Note that the details above imply that when a new partition starts, a running M may have been in a tight loop and so hasn't emitted any events for the last partition, in which case we need to preempt it to have it dump its initial state. Generally, moving partitions forward doesn't have to even involve preemption.

Michael K: I have also have a rougher draft of a trace parser API, with input from Michael Pratt.

I'd encourage you to take a look at https://github.com/dominikh/gotraceui/blob/04107aeaa72e30c50bb6d10e9f2b6ca384fafc3d/trace/parser.go#L18-L77 for the data layout I've chosen in gotraceui. It's nothing groundbreaking, but it highlights the need to avoid the use of pointers.

That seems useful for the current trace format, thanks. For the new format, I don't expect to expand the trace events out of their encoded form at all, but rather decode them lazily (either copy them out wholesale or just point into the encoded trace data in the input buffer, both of which are cheap from the perspective of the GC).

dominikh commented 1 year ago

In both cases it's straightforward to infer the state of the goroutine from the events that must happen to transition goroutines out of these states: unblocking and starting to run.

That has two implications, however:

goroutines that don't unblock during the trace will be unaccounted for
the states of all goroutines can't be determined without looking at the entire trace

I realize that with self-contained partitions it isn't feasible to include the state of all goroutines in all partitions, but maybe it should optionally be possible to dump complete state in the first partition, for users who want a complete view? However that wouldn't really fit into an M-centric format…

That seems useful for the current trace format, thanks. For the new format, I don't expect to expand the trace events out of their encoded form at all, but rather decode them lazily (either copy them out wholesale or just point into the encoded trace data in the input buffer, both of which are cheap from the perspective of the GC).

I feel like the current parser + its types and the new approach you describe are at two different layers of abstraction. The current parser isn't exposing raw events. Instead it is doing a fair bit of processing of arguments, and it populates Link fields, which point to related events. Your approach sounds a lot closer to just casting from []byte to a type describing the raw events. And there'll still need to be a layer of abstraction on top of that that can be consumed by users (unless you expect them to build their own, which would work for me, but be a barrier to entry for people less familiar with the underlying file format.)

mknyszek commented 1 year ago

That has two implications, however:

goroutines that don't unblock during the trace will be unaccounted for

the states of all goroutines can't be determined without looking at the entire trace

I realize that with self-contained partitions it isn't feasible to include the state of all goroutines in all partitions, but maybe it should optionally be possible to dump complete state in the first partition, for users who want a complete view?

Both of those things are good points.

Dumping the state of the world at the start is one option but I'm also reluctant to do anything around this because it adds a lot of overhead. Interrogating every goroutine can take a while, and the world needs to be effectively stopped while it happens (or the synchronization will get really complicated). At the end of the day, my gut feeling is that the execution trace should focus solely on what's necessary for tracing execution, not what could execute.

However, I can definitely see that getting the information you describe has utility and we don't want to lose that. In the last meeting we discussed how goroutine profiles could be used to fill this gap. As a baseline, it should be fairly straightforward to correlate a goroutine profile's STW timestamp with a STW event in the trace. Taking that one step further, we could explicitly mention that the STW was for a goroutine profile in the trace. (In theory we could also dump the goroutine profile into the trace, like we do with CPU samples. I am not opposed to this, but I probably wouldn't do it to start with.)

You should be able to get a close approximation to the current behavior by starting a trace and then immediately grabbing a goroutine profile. Does that sound reasonable? Perhaps I'm missing some use-case that's totally missed. FTR, I fully recognize that we're losing something here in the trace, but I argue the net benefit is worth that cost.

Also I just want to disclaim the design details in the last paragraph: subject to change in the first document draft. :) That's just where my head's at right now. It may turn out that the per-M synchronization I have in mind is too complex.

However that wouldn't really fit into an M-centric format…

I think it works fine if, like I mention above, we're willing to give a little bit of leeway. Maybe you don't have a snapshot of the state of all goroutines at the moment the trace starts, but you have one from very soon after the trace starts, which is probably good enough?

I feel like the current parser + its types and the new approach you describe are at two different layers of abstraction. The current parser isn't exposing raw events. Instead it is doing a fair bit of processing of arguments, and it populates Link fields, which point to related events. Your approach sounds a lot closer to just casting from []byte to a type describing the raw events. And there'll still need to be a layer of abstraction on top of that that can be consumed by users (unless you expect them to build their own, which would work for me, but be a barrier to entry for people less familiar with the underlying file format.)

That's another good point. To be clear, I do plan to have an API with some level of abstraction and not quite just []byte-to-type. :) Events will be opaque and fields will be accessed through methods, so we have a lot of wiggle room. However, something like the Link field I think requires keeping the whole trace in memory, because you never know when someone might want to access an event from a long long time ago (though I haven't thought this through). In theory an accessor can be arbitrarily complicated and even re-parse the trace to find the event, I suppose. :P

My general hope and expectation is that the vast majority of users should never have to look at the API at all, and instead rely on tools built with it. And those that do use the API don't need to understand the file format, just the execution model it presents (which I think is somewhat unavoidable).

dominikh commented 1 year ago

Dumping the state of the world at the start is one option but I'm also reluctant to do anything around this because it adds a lot of overhead. Interrogating every goroutine can take a while, and the world needs to be effectively stopped while it happens (or the synchronization will get really complicated).

I think not having to STW and enumerate all goroutines was one of the design goals, as it didn't scale well. I take it the ragged barrier approach didn't pan out?

At the end of the day, my gut feeling is that the execution trace should focus solely on what's necessary for tracing execution, not what could execute.

One use case of looking at execution traces as they are now is debugging synchronization issues. Imagine having an N:M producer/consumer model using goroutines and channels, and we're debugging why producers are blocking. The reason might be that all of the consumers are stuck, which is only evident if we can see them be stuck. If they're already stuck at the beginning of the trace then they would be invisible in the new implementation.

More generally speaking, a lot of users aren't interested in the per-P or per-M views and instead want to see what each goroutine is doing (see also the per-goroutine timelines in gotraceui.) It turns out that per-G views are useful for debugging correctness and performance issues in user code and that traces aren't only useful for debugging the runtime.

You should be able to get a close approximation to the current behavior by starting a trace and then immediately grabbing a goroutine profile. Does that sound reasonable?

In theory that sounds fine, assuming goroutine profiles are proper STW snapshots? Otherwise it would probably be difficult to synchronize the trace and the profile.

At least this would give people the choice if they want to tolerate STW for more detailed traces.

However that wouldn't really fit into an M-centric format…

I think it works fine if, like I mention above, we're willing to give a little bit of leeway. Maybe you don't have a snapshot of the state of all goroutines at the moment the trace starts, but you have one from very soon after the trace starts, which is probably good enough?

Probably, yeah.

mknyszek commented 1 year ago

I think not having to STW and enumerate all goroutines was one of the design goals, as it didn't scale well. I take it the ragged barrier approach didn't pan out?

It's not quite that it didn't pan out and more that it just doesn't work with a per-M approach given other design constraints.

The ragged barrier I mentioned in an earlier design sketch is the forEachP one which is ultimately still P-focused. Part of the reason I want to switch to a per-M approach is to remove the GoSysExit complexity that comes from the fact that goroutines can in fact run without Ps sometimes. That complexity is part of the event's semantics, so it tends to leak everywhere.

A per-M approach can side-step a lot of that complexity, but it means we need a way to synchronize all Ms that doesn't involve waiting until the M gets back into the scheduler. What I wrote above is a rough sketch of a proposed lightweight synchronization mechanism that most of the time doesn't require preemption. I think that in general we can't require preemption in a per-M approach if we want to be able to simplify the no-P edge cases and also get events out of e.g. sysmon, which always runs without a P. (EDIT: D'oh. I keep forgetting that the current tracer can indeed emit events without a P. So that's really more that just that we don't currently have a great way of tracking Ms in general. I would like to add more explicit M-related events. The GoSysExit point still stands because it races with a trace stop-the-world, which is the main source of complexity. If we synchronize via Ms that goes away.)

(In effect, I am proposing to shift the GoSysExit complexity somewhere else, but I hope that in the end it will be less complexity overall because the M synchronization details can probably be written in a way such that the details don't leak as much.)

dominikh commented 1 year ago

An aside that might steer you closer to a per-M approach: I tried adding per-M timelines to gotraceui using the current format and found it impossible due to the current event sorting logic. I ran into scenarios where a P would start on an M while the M was still blocked in a syscall.

rhysh commented 1 year ago

In theory that sounds fine, assuming goroutine profiles are proper STW snapshots? Otherwise it would probably be difficult to synchronize the trace and the profile.

Yes, goroutine profiles are STW snapshots, but the duration of the STW pause does not vary based on the number of goroutines. Go 1.19 includes https://go.dev/cl/387415, which says "... do only a fixed amount of bookkeeping while the world is stopped. Install a barrier so the scheduler confirms that a goroutine appears in the profile, with its stack recorded exactly as it was during the stop-the-world pause, before it allows that goroutine to execute."

We'd want something like a "STW" event in the execution trace that we could tie back to the particular goroutine profile (maybe record the goroutine ID of the initiator and either a call stack or the reason string); although a protobuf-formatted goroutine profile will include a timestamp, it's determined by the runtime/pprof package after all of the stacks have been collected, rather than by the runtime package during the STW that the snapshot represents.

goroutines can in fact run without Ps sometimes

This goes for g0 as well: CPUSample events include the G and P (when they're available), but an M can run without either, such as when "spinning" near the bottom of findRunnable. That event should have included a reference to the M. As it is, it's tricky to attribute the on-CPU cost of spinning Ms, and to find the application behaviors that result in higher/lower costs there.

Having a tracing buffer available to the M could also simplify the way we get information from the SIGPROF handler into the trace.

mknyszek commented 1 year ago

2023-03-16

Attendees: @mknyszek @felixge @bboreham @nsrip-dd @dominikh @rhysh @thepudds

Frame pointer unwinding for tracer CL stack is ready for review. No change to existing test cases is needed anymore. https://go-review.googlesource.com/c/go/+/476235 Felix Geisendörfer
- Works without any changes to tests! Produces the same traces as gentraceback (or the new thing). Slightly hacky inline expansion handling. Will be sorted out in review.
- The next frontier is arm64.
Execution traces v2 document (public document coming very soon, sorry!)
- Michael K: Rhys, what did you mean in your comment about traces not being usable until a goroutine profile is collected?
- Rhys: Today there's an instant, but in the new design the trace start might be ragged. Effectively the STW for the goroutine profile becomes the synchronization point.
- Michael K: The new design shifted from what I might've said before: it does actually have a single global start point, it just might be slightly wrong in the case of races.
- Felix: When a goroutine unblocks for the first time after the trace starts, the previous state is recorded, giving us the knowledge of what that goroutine was doing from the start of the trace until now.
- Rhys: The doc mentions that stack traces might not be recorded for GoStart events.
- Michael K: I might want to walk back on this.
- Michael K: Doubling the network of a trace wouldn’t be an issue, does anybody disagree with this?
- Rhys: Good to revisit all the things.
- Felix: Network bandwidth costs could be prohibitive for non-stop tracing.
- Michael K: Flight recorder mode recording into a ring buffer with a single pc stack trace. …
- Michael K: Filtering events could shrink the size of the trace in the future as well.
- Michael K: Maybe we should pursue making the trace fast by default, and then have other ways to deal with it later.
- Felix: Maybe compression can be done after trace recording.
- Michael K: Sounds reasonable. I’m also more convinced of the 8 byte format now.
- The pudds: Something like lz4 or snappy in memory by the runtime + zstd for network transport would compose well for total compression on the wire.
- Michael K: On-the-fly compression seems like it would be too slow.
- The pudds: They claim “faster than memcpy” for some lz4 style implementations. Though it's not always true based on exact usage
- Michael K: We should explore this more. Not very familiar with compression algorithms.
- Felix: Anecdote: Adding an efficient timestamp encoding to pprof didn’t beat a naive way of adding them via pprof labels after compression.
- Michael K: Could users use a goroutine to take the things out of a trace that they want to decide what to send?
- Felix: That would seem very useful.
- Rhys: It’s not only network bandwidth. It’s also memory in process. Currently recording traces/profiles in a very simple way that is safe for all the app owners. Wants to be careful of how much memory is used. Need to buffer all data in zip before sending it because remote blob storage expects the size to be known upfront. Had this in prod for several years.
- Rhys: Made a comment elsewhere that call stacks are useful for tools not just for people. Can be used to write instrumentation after the fact rather than deploying an update to prod.
- Michael K: How much easier would your life be if a parser could say that while you’re parsing you could take a slice. I’m streaming this in, but I know there is a partition, so I cut it off now. You get a slice of the traces.
- Rhys: Apps might be able to do flight recording use cases themselves.
- Michael K: If trace parsing API can make cuts efficiently, that would help.
- Rhys: Yes.
- Dominik: if it buffers the same amount, it'll have to send off more frequently; but the memory usage should stay the same, right?
- Dominik: you'd be able to send off individual partitions
- Rhys: In some contexts network bandwidth is essentially free. Storage is not.
- Felix: Is 1% memory overhead reasonable? 100MB for a 10GB app?
- Michael K: Expects 16-64kB buffer per M. For most apps it will be much less than 1%.
- Rhys: On a big machine 1% gives you a lot of room to work with. One design constraint for my library in prod is to not allow users to configure it all. Sometimes it’s in little daemons that don’t own the machine and just use 10s of MBs total that they use.
- Felix: So we want to figure out if design should scale down?
- Rhys: Yes.
- Michael K: Yes, we should care about it. I suspect we can get substantially less than 1%.
- Michael K: Question is how much is held in memory at once before writing it out? Is the network bandwidth a concern?
- Felix: The new design seems like it would allow recording many small tables more often.
- Michael K: Writing out stack tables will usually be less than 1000 entries usually, expect it to be fast, e.g. microseconds.
- Felix: Question about how the OS syncs clock across cores without massive overhead.
- Michael K: It’s not perfect, the clock can still be broken. You still have a RDTSC every time. The kernel does clever fixing up.
- Michael K: You’re right, it doesn’t get around the global sync issue. Michael Pratt pointed out the details regarding VDSO with regards to that.
- Michael K: The other benefits are worth it.
- Michael k: One benefit is aligning with other data sources, the other thing is making the timestamps monotonic.
- Dominik: the OS tries to fix things like cross-core/CPU drift of tsc.
- Michael K: Manually calling RDTSC is likely to be problematic when a G/M/P migrates CPU.
- Rhys: Trace writing should be pretty fast? I had to do some tricky things with buffer allocations to make this work. But that’s a problem I created for myself I think.
- Dominik: One "downside" is that the kernel detects bad TSC and falls back to other time sources that are much slower and not vDSO, and bad TSC include some modern consumer machines with faulty BIOSes. But it's probably fair to require well-working systems for Go tracing to work. It's not a regression from the current support, at any rate.
- Michael K: Seems like a fool's errand to make every faulty system out there work. I think we should do a timestamp fixup in the parser. We have a way to do partial ordering.
- Rhys: Does that interact poorly with the change to 8-bit sequence numbers?
- Michael K: Don’t think so. Sequence numbers are per-G. In an 8 byte scheme we can make this bigger. But that’s a good question. I will have to think about it more.
- Dominik: I remember relying entirely on the sequence IDs in the current format and skipping the check, and it did lead to invalid orderings. Not sure why
- Michael K: If timestamps are not fixed up and you’re consuming this … it depends on what you mean by invalid orderings.
- Dominik: It didn't pass the sanity checks of event ordering. Goroutines unblocking/starting/… out of order
- Michael K: I’ll catch up with you offline Dominik.

mknyszek commented 1 year ago

2023-03-30 Sync

Attendees: @prattmic @mknyszek @felixge @nsrip-dd @bboreham @thepudds

Michael P: #58474 - proposal: debug/gosym: expose function start line
- Also inlined calls
- Polar signals wants to use this for PGO.
- Planned for x/debug/gocore?
- FYI: Going to land in x/exp first to iterate on an API.
- Initial package sketch in https://go.dev/cl/474543
- Michael K: Also plan to do the same for x/debug/internal/gocore before making it public.
- Felix: Not tied to the Go release cycle?
- Michael P: Yeah.
- Michael P: Undecided on how far back to support Go binaries in the library.
- Felix: Start with the supported ones and if people complain then add more?
- Michael P: Do we actively drop support or just don't fix them if they break?
  - If we just let them break then everyone might expect it to just work in the long term, and if we break it then we need to fix it.
- Michael K: Crazy idea: self-description?
- Michael P: I'm skeptical but maybe.
Felix: Frame pointer unwinding updates
- Current status is that the Intel patch set is ready for review. Comments all addressed.
- It would be great to get another round of review and get it submitted soon.
- Nick and I have been looking into getting it working on ARM. I think we're down to the last issue. Feeling good about getting it working for the upcoming release.
- Michael P: I think your CL is ready, just needs another +2.
Michael K: I updated the execution traces v2 document with a few things (public document coming tomorrow).
- I measured what we save by de-duplicating stack IDs between GoBlock/GoStart pairs and we got under 1% CPU usage with FP tracebacks! So, I removed the proposed API to control traceback depth.
- The event encoding section has been rewritten with a much simpler encoding. I expect about a 2x-2.5x trace size increase. I also propose a fast compression scheme as an out.
- Fleshed out the self-description section to be more concrete. Should the self-description data use some well-known encoding format?
- Added a use-cases section to help cement what the new design actually lets us do.
- Felix: We're worried about trace size in the long run. Also, what does the roadmap look like?
- Michael K: I think we can probably get this in behind a GOEXPERIMENT for Go 1.21. Definitely the default for Go 1.22.
- Felix: It would help if we could toggle between CPU samples and the full trace. We could do a lot with just CPU samples and timestamps. Our users are debugging p99 spans, and we can confidently say how much of that time was on-CPU. Currently impossible with pprof labels.
- Michael K: Does streaming help?
- Felix: Unfortunately we have to move data between clouds which eats up egress bandwidth. Also, it would have to be processed in-process.
- Michael P: I think it was always a follow-up to choose what you want in the trace, but it wasn't the highest priority. Maybe it should be higher.
- Michael K: I'm thinking we should look into APIs now.
- Michael P: Maybe there should be an API to decide in process? The former is more efficient, but it might be complicated.
- Felix: We might be able to do it today.
- Michael K: Maybe we try the in-process one first, and then later decide to try the API because it's a longer road. We need a parsing API for the new trace anyway, so we can just dump a copy into x/exp for experimentation.
Michael K: I'd like to make a small collection of representative traces. I can look within Google. Any other ideas?
- Trace anonymization?
- Felix: I already have a prototype but it wasn't aggressive enough (thanks Rhys for pointing that out). My version keeps standard library frames. I can enhance it to a point where it's less scary and we can look at it.
- Bryan: I can volunteer traces out of Prometheus.
Michael K: For CPU profile samples in the new trace format, would we be OK with a scheme that's simpler but might lose samples for the tracer itself?
- Idea: Write the event directly from the signal handler. When we try to pick up the buffer from the current M, we check if it's already being written to. If it is, drop the sample.
  - The rest of the code should be safe to run from a signal handler.
- Is there something I'm missing?
- Michael P: What happens if the signal lands on a C thread that has never run Go code.
- Michael K: Can we do that today?
- Michael P: Today we write to a global profile buffer.
- Michael K: We could keep doing that just for that case. An advantage of both schemes is that at least Go code doesn't have to be best-effort about this.
- Michael P: The problem is we only read from the profile buffer when we read trace events. Could be exacerbated if a program is mostly C with just a little bit of Go code. Though, it is already a problem today. That could also be fixed in today's scheme. The CPU buffer reader could push into the trace instead.
Felix: What’s the compatibility policy for diagnostic data (traces, pprof)? Does the latest version of Go really need to be able to read data from unsupported versions of Go? Some cmd/trace code goes back to go1.5 (2015).
- Michael K: For the current code, I'd honestly be OK just supporting the last 2 Go versions. The other versions are always there if you need them. However, I think if we make a trace parsing API, we may need to support arbitrarily old formats.
- Michael K: We should probably just go with whatever decision we make for x/debug.
Michael K: Spoke with David Ashpole and others about OTLP integration down the line and what that might look like. (Long-term goal: can we push tracing information up efficiently?)
- One thing that came up was possible conversion overhead to OTLP. Another option is to smuggle our trace data as a binary blob and have the collector figure out what to do with it.
- Felix: Working with the OTel profiling group which is trying to standardize across profiling formats. It's still an open question whether we want to allow opaque binary blobs to flow through the collector. Want to be able to process things going through. At the same time, having an intermediate format that's a superset of JFR seems like a bad time.

mknyszek commented 1 year ago

2023-04-13 Sync

Attendees: @nsrip-dd @bboreham @thepudds @prattmic @mknyszek

Michael K: Turns out that my "optimization" to reduce duplicate stacks was actually just dropping stacks. It's not actually useful. (Thanks Nick!)
thepudds: What kind of event rate are you expecting for 1% overhead? 500,000 events per CPU for 1% overhead gives a 20 ns budget.
- Michael K: I did a worst-case analysis based on ~600 KiB/s for 6 cores.
- Michael P: Depends on parallelism.
- thepudds: 1% of total CPU available right?
- Michael P: If the program itself has limited parallelism, then we don't want 1% of all available CPU. I like the way you're thinking about the per-event budget.
- Michael K: +1.
- thepudds: Trade-off between what's done in-line or what's done out-of-line. There's a fair amount of sensitivity to what's done in-line.
- Michael P: That's true, there's some cost to the trace reader.
- thepudds: I wonder how much people will appreciate that subtlety. Also per-M buffer memory use?
- Michael K: I'm not worried about it because of global buffer flushes per-partition. It comes down to how many threads can be active over 1 second.
- AI(mknyszek): Write down per-M buffer memory use subtleties.
- Michael P: If a lot of threads are going to sleep on syscalls, the scheduler might be your real bottleneck.
- Michael K: Akin to idle-priority GC workers where maybe we have an idle-priority trace compressor? Optimistic compression?
- Michael P: CPU-bound apps with lots of running goroutines is sort of the worst case, and it's a cascading failure in a sense.
- thepudds: Experimenting with bring-your-own-compression.
Nick: Felix’s amd64 frame pointer unwinding CLs were merged! We’d like to do arm64 as well. Main blocker right now is https://go-review.googlesource.com/c/go/+/241158 (to fix up frame pointers when copying stacks), needs review.
- Michael K: I'll unblock you after this meeting.
Michael K: Did a few trace performance investigations:
- Reducing trace depth didn't meaningfully reduce trace overhead.
- A probabilistic stack experiment I tried also didn't meaningfully change things (though, smaller trace size due to fewer stacks).
- The last runs I did seemed to indicate <1% overhead with what's at tip-of-tree, but I need to verify that again.
- Compared varint trace path for 3 arguments + timestamp with an earlier version of the proposal (4-byte words, 16-bit timestamp deltas, etc.):
```
name       time/op
Varint-48  13.9ns ± 0%
Word-48    3.20ns ± 1%
```

felixge commented 1 year ago

@mknyszek @thepudds interesting overhead discussion ❤️! FWIW I don't think we'll be able to get overhead < 1% for pathological workloads (e.g. channel ping pong). Our internal target is < 1% overhead for 95% of applications. From what I've seen, I think we'll hit that with frame pointer unwinding.

@mknyszek thanks for all the experiments. The varint benchmark sounds promising! But we're still struggling to confirm it as a significant overhead source on our end. Looking forward to discussing it more at the next sync.

Sorry for missing the meeting, I'm on vacation right now 🏖️.

mknyszek commented 1 year ago

FWIW I don't think we'll be able to get overhead < 1% for pathological workloads (e.g. channel ping pong). Our internal target is < 1% overhead for 95% of applications. From what I've seen, I think we'll hit that with frame pointer unwinding.

I'd like to learn more about what the <1% looks like (is that while the tracer is enabled continuously?). But when you get back. :)

thanks for all the experiments. The varint benchmark sounds promising! But we're still struggling to confirm it as a significant overhead source on our end. Looking forward to discussing it more at the next sync.

That's fair, and I'm considering putting it rest for now and just proceeding with everything else. These experiments were easy to run in the background since I was on rotation last week, but in hindsight I think I should've checked a couple other things first. Specifically, I should've check for whether it's worth changing the encoding solely for the encoder-side costs at all:

What is the cost of switching to nanotime?
If we remove the bulk of the encoding work, what is the upper limit on what we can save?

I'll run these before the next meeting and I should have the answers I need on whether to pursue this further in the near-term.

Thinking more long-term, LEB128 is well-known for being relatively slow to decode, and I suspect in-process decode performance may become a bottleneck for some use-cases. I think there are still a lot of questions there but I also want to leave open as many doors as possible. I'll add a line to the agenda for next time about this.

(Since we're already overhauling the trace format, I'd like to make sure these kinds of use-cases are considered to avoid making more breaking changes than necessary down the line.)

Sorry for missing the meeting, I'm on vacation right now 🏖️.

No worries at all; enjoy your vacation! :)

felixge commented 1 year ago

I'd like to learn more about what the <1% looks like (is that while the tracer is enabled continuously?). But when you get back. :)

We're still working out the details, but yeah, ideally the overhead is <1% (cpu, memory, latency) for 95% of applications while the tracer is recording.

Thinking more long-term, LEB128 is well-known for being relatively slow to decode, and I suspect in-process decode performance may become a bottleneck for some use-cases.

Yeah, I think there are many good reasons for moving away from LEB128 in the long run. In-process decoding is an excellent reason. If we could find an approach that doesn't increase data volumes or encoding overhead, it'd be an absolute no-brainer. But if that's not possible we just need to be careful that we're hitting a good balance.

mknyszek commented 1 year ago

2023-04-27 Sync

Attendees: @mknyszek @thepudds @nsrip-dd @felixge @prattmic @bboreham @cagedmantis @irfansharif

Felix & Nick: Patches that need merging so we can set tracefpunwindoff=0 for amd64 and arm64:
- https://go-review.googlesource.com/c/go/+/489015 (fix for amd64 crash michael k found)
- https://go-review.googlesource.com/c/go/+/488755 (incgocallback crash nick found on arm)
- https://go-review.googlesource.com/c/go/+/481636 (nice to have, but just needed to make debugCheckBP happy)
- https://go-review.googlesource.com/c/go/+/489117 (re-enable frame pointer unwinding for amd64)
- - one more CL (not submitted yet) to set tracefpunwindoff=0 for arm64 by default.
Michael K: Further performance experiments.
- First, I think FP tracebacks alone have actually hit <1%, woo! I've confirmed it multiple times now.
- Second, if I switch to nanotime in the current implementation there's no statistically significant performance difference if the time granularity is roughly the same as it was for cputicks. Currently we divide cputicks by 64, which gives us ~20 ns granularity for a 3 GHz CPU. If I just do plain nanotime, so 1 ns granularity, there's about a 0.5% regression (!). if I apply the same division by 64 to nanotime, so 64 ns granularity, it's a 0.2% improvement (!!). This difference is strongly correlated with trace size. All-in-all, I think this means 2 things: (1) nanotime is not worth worrying about and (2) maybe the trace encoding does actually matter a good bit? or at least, how the timestamp is encoded specifically. like, if it can often stay in 1-2 bytes, that apparently makes a fairly big difference. I'm not exactly certain why.
- Thirdly, if I comment out all of the trace buffer writes but still make sure to grab the stack trace, I get a 0.5% regression (!?). The trace size is miniscule now, confirming that in fact nothing is being written out. This contradicts ~everything above and I have no idea why. I'm probably just doing something wrong.
- Felix: I think we could go up to 128 ns because scheduling goroutines takes ~200 ns.
- Michael P: We get some more headroom for larger granularity as long as we have a partial ordering.
- Michael K: We already have partial ordering, so I might go ahead and do the timestamp fixups for Go 1.21.
Michael K: I'm going to go ahead and make the proposal public, modulo questions about trace encoding. I'd like to get started on landing parts of the proposal as a GOEXPERIMENT for Go 1.21, so we can start working on a parsing API for it. Also gives us a space to iterate on a trace encoding.
- thepudds: Event filtering has been discussed verbally but hasn't been put into the proposal. Is that worth having a marker for a future consideration?
- Michael K: I think we should write it down.
- thepudds: There's also interplay with the encoding.
- Felix: The doc does mention filtering, but in the use-cases section (online analysis).
- Michael K: Yes, we’ll try filtering in user-space first, but having a dedicated API for it is not off the table.
- thepudds: Maybe it doesn’t matter if it's fast enough. But if we filter on the producer side we could even skip taking a timestamp - could be relevant in some cases.
- thepudds: Can you clarify the trace size decrease from using nanotime instead of cputicks?
- Michael K: Saw 1.69% decrease in size when going to 64ns resolution from 20ns (roughly, using the current encoding).
- thepudds: In practice it will probably depend on the delta between the timestamps and how compression plays out.
thepudds: Any progress on representative traces?
- Felix: We have the execution tracer running on staging collecting data. We can look into anonymizing and sharing that data. My traceutils can anonymize traces now and I think is good enough to pass security review.
- Michael K: Haven't had time recently, but definitely want to do this.
Trace parsing
- Michael K: Will work on this during the freeze. Dominik probably has lots of input on this.
- Michael P: Freeze is the last week of May (beginning or end?).

bboreham commented 1 year ago

Regarding representative traces, I attach two traces from the Open Source monitoring system Prometheus: prom.trace1.gz prom.trace2.gz

Both cover 5 seconds; the first one is from a small Prometheus and the second from a much bigger instance. Let me know if something longer/bigger/different would help more.

mknyszek commented 1 year ago

2023-05-11 Sync

Attendees: @mknyszek @rhysh @bboreham @thepudds @aclements @nsrip-dd @dominikh @prattmic

Michael K: Goroutine profiles!
- The new proposed execution trace changes will mean that we don't capture the state of goroutines that never run while the trace is active. A proposed workaround is to use goroutine profiles to achieve basically the same thing: start tracing then immediately grab a goroutine profile. The trace tooling can then correlate the two to obtain a complete picture of all goroutines.
- This works, except that only one type of goroutine profile actually distinguishes distinct goroutines, and it's not super amenable to being machine-readable.
- We could add this to the gzipped pprof proto goroutine profile, but if we add label information on all samples, then we might significantly increase the size of the proto (differing labels prevent merging).
- Three options:
  - Add a new goroutine profile type (debug=3) that's a gzipped pprof proto with the additional information.
  - Add the information to the default (debug=0) goroutine profile type. Maybe the increase in size isn't actually that bad?
  - Just make traces STW on trace start in the same way as we do for goroutine profiles (to avoid keeping the world stopped while we inspect every single goroutine). This would only happen on trace start, not on new partitions.
- I'm sort of inclined to just give up and do the third option at this point. Though, if there's value in the other two outside of traces, maybe worthwhile to still consider?
- FYI, I think I also figured out how to get the state of goroutines without stopping the world.
- Felix: +1 on just doing it in the trace.
- Austin: Easier from a user perspective.
- (earlier) Michael K: Goroutine Profile Matrix is useful.
- Bryan: Hit this problem of options on the goroutine profile type. The amount of time spent asleep is useful but only available on the text one [debug=2].
- Felix: +1 to that.
- Austin: The pprof proto is pretty extensible.
- Rhys: We've got a lot of goroutines on some services, and those teams have complained about the STW pauses from goroutine profiles and traces.
- Michael P: re: Gscan, how does that work with the GC?
- Michael K: Gscan acts like a spinlock.
- Michael P: If we wanted to get fancy, the Gscan bit could be more like a semaphore count, because everything looking at the G is read-only.
Felix G:
- Two small cleanup patches (CL 481617 and CL 489095) are “Ready to submit”. CL 481636 for arm64 needs one more review (it’s just for debugCheckBP, not needed for tracer).
- Try to enable debugCheckBP by default?
  - Austin: Maybe a debug mode on a new builder? longtest builder?
  - Felix G: Considered changing the const to a var to make it easier to enable.
  - Michael K: I think it's just the fact that we're close to the freeze.
  - Austin: If it was the start of the release cycle I would be OK with turning it. Worried about production crashes. Also concerned about user assembly code that does weird stuff to the frame pointer. I would want a decent amount of soak time. We test in google3 but we can roll it back quickly.
  - Felix G: Based on that, would it be OK to land a CL to put it behind a GOEXPERIMENT?
  - Austin: Go for it.
  - Michael P: We can run it against Google-internal tests for coverage.
- Expand frame pointer usage into more places? E.g. replacing funcspdelta lookup in unwinder? Also runtime.Callers and similar APIs, e.g. non-CPU profilers?
- Sharing traces.
  - Bryan shared some traces; thank you!
  - Looking into it internally. What I could do earlier is extract statistical information. Hoping to eventually share full traces.
Michael K:
- FYI tracer refactoring https://go.dev/cl/494192
- Michael K: Just plugging along, hope to have it landed before the freeze. Plan to split the tracer into 2 implementations.
Felix G: Nick and I discussed that part of the proposal captures syscall durations, which would mean capturing both start and end times. We're concerned that it could produce a lot more events where someone is making a lot of syscalls or calling into cgo often. We also couldn't think of a use-case for the short syscalls. Can get an upper bound on time between syscalls when you're doing a lot of them.
- Dominik: "really short" is relative, it can be up to 10 ms.
- Rhys: I've felt a need for this in the last 6 months. crypto/tls does a write syscall, but it's hard to determine how much time is spent in the syscall vs. in application time. Knowing what to go after would be useful.
- Felix G: Wouldn't the CPU profiler be more appropriate for this case?
- Rhys: If your application does only one kind of work, then sure. But if your application does two kinds of work, sometimes small, sometimes large, then it's harder. Could be done, but it's harder.
- Michael K: Not worried about performance since syscalls are expensive because of Meltdown. The "syscall end" event is also pretty easy to make small.
- Felix G: Concern about short cgo calls in a loop.
- Michael K: Maybe we should just treat cgo calls differently.
- Austin: It is a little weird that we trace them. Might just be a historical artifact.
- Felix G: I support not tracing cgo calls. We don't trace Go calls for a reason.
- Dominik: If cgo got a lot faster then I agree it shouldn't be in the trace.
- Felix G: Not tracing cgo could also be a good use case for enabling/disabling individual events via an API.

mknyszek commented 1 year ago

2023-05-25 Sync

Attendees: @mknyszek @thepudds @aclements

Before the notes, I'd like to note that this sync barely happened. I showed up 15 minutes late (very sorry about that), though it seems like many of us were unavailable for the sync anyway, so perhaps it was just not meant to be. For anyone that showed up in those first 15 minutes but left confused, I'm very sorry about that; that's a mistake on my part. Austin and I joined late and briefly discussed some topics with thepudds, but the sync did not run very long at all.

Michael K: No updates; I've been working on getting stuff in before the code freeze.
thepudds: I did some work on the trace event encoding that I'd like to share. I'll sync up with Michael K briefly before the next meeting.
thepudds: RE: #58106, I'm not certain if we have everything we need after the new heap metrics landed (#56857) to allow users to detect when the heap is saturating GOMEMLIMIT. I'll send Michael K a pointer and we'll discuss further.

mknyszek commented 1 year ago

2023-06-08 Sync

Attendees: @mknyszek @prattmic @bboreham @leitzler @nsrip-dd @rhysh @felixge @irfansharif

Michael K: Working on new tracer implementation. Will finally (finally) make the doc public in the proposal repo and file an issue to track work.
Felix: Update. Working on building features on top of the execution tracer. Eagerly waiting on the next release. Hoping to share more on this publicly soon. Curious about plans for the next release cycle. One theme we're interested in is getting frame pointer unwinding in more places. CPU profiling should be straightforward, but the GC or copystack might see some improvements as well.
- Michael K: Sounds good to me. Would be nice to have just one unwinder on amd64 and arm64.
- Felix: We can't totally replace it, but we can swap the frame data lookup with just a pointer delta. It wouldn't be the tight loop we have in the tracer, but still better. Ideal state would be to not use the PC delta lookup tables at all (and not include them in the binary), but this will be difficult in some places.
Michael P: A few months Felix sent a CL for sysmon to do fewer preemptions. This came up yesterday when debugging the scheduler. Turns out the extra preemptions in sysmon are masking work conservation issues. We started discussing again the possibility of reducing preemptions from sysmon, at least as a debug mode initially.
Michael K: Plan to look into core analysis libraries and making them public.
Bryan: This may be off-topic but I was looking for tools to investigate growth in /memory/classes/heap/unused:bytes. Would viewcore be good for that?
- Michael K: Yeah, I think so.
Michael K: Flight recorder mode?
- Nick: Yeah, I think we'd use it.
- Rhys: Would use it for one-offs. Currently trying to collect traces one second every hour from services. Allow the app to say "now would be a good time to collect a trace."
- Michael P: Would be handy combined with core files. Might only be useful to me debugging the scheduler.
Rhys: Core files are great because they have all the info, but haven't put in the time to figure out how to use core files in practice because they use everything. For instance, correctly accounting for customer data.
- Michael P: That's fair. Where I see it used is for one-offs where one person has access to one core file for debugging.
- Michael K: That's one thing the old defunct heap dump format had over core files: it basically was just a memory map.
- Rhys: Do core files include precise "pointer or not" data for all data? Would it be feasible to write a program that overwrites all data that is not a pointer?
  - Michael K: I think this is feasible. The API can't write though.
  - Rhys: Could one just identify the byte offsets in the core file of what not to keep and write a tool to copy the core file?
  - Michael K: Seems reasonable.
  - Rhys: This is the main reason I don't feel comfortable using core files.
  - Felix: Use-case we're interested in is OOM kills. Heap profile collected regularly is not enough, because it's usually not "more of the same" causing the OOM, but something new and recent. Core dumps aren't the only solution. V8 has a callback that allows you to be informed when one is about to run out of memory.
  - Michael P: Linux OOM killer might disable core dumping.
  - Felix: In most cases we see the cgroups OOM killer, which is a bit different. There are many flavors of OOM killer.
  - Michael P: I will double-check.
  - Michael K: One of the motivations behind the viewcore improvements is debugging OOMs. Hard memory limits seem undesirable and maybe easier in the V8 world.
  - Felix: Ideally it would be something like everything else stops while a goroutine uploads some data when things are really bad. Might be difficult because that goroutine probably needs to do complicated things.
  - Michael P: Asynchronous signal to applications for reclaim, but it's not ideal. You might just let memory balloon elsewhere e.g. in the kernel.
  - Felix: One problem with cgroups is you can be allocating so quickly that you can't actually catch the problem before you fall over. Another problem is that the OOM killer often kills innocent programs, since it can't really identify who's at fault.
  - Michael P: The kernel OOM-kills process with SIGKILL unconditionally, which means that it never writes a core dump.

mknyszek commented 1 year ago

2023-06-23 Sync

Attendees: @aclements @mknyszek @prattmic @nsrip-dd @rhysh @felixge

Michael K: Published the execution tracer doc. Working on the tracer.
Felix: Continuing to push the idea of sharing anonymized traces. Haven't been shut down entirely and security seems to say that it's OK if I think it's a good idea.
- Austin: Curious about what goes into trace anonymization, specifically for PGO.
- Felix: My approach is going through the string table and replacing everything that's not a known std symbol and replacing it with XXX. Also replacing log messages with XXX. Still leaks some information like CPU utilization, size of the instance running, etc. Doesn't seem like a deal breaker for just sharing some data. Might not be useful for PGO.
- Rhys: That means that you can write out a new execution trace. Can you do it in the general sense or can you only make edits to the string table? Being able to modify the in-memory version of an execution trace and write it out is powerful. Usually write instrumentation after the fact. It would be nice to write that out directly into the trace. To what extent do you have that capability?
- Felix: Currently I have a streaming decoder for the trace. Not sufficient for manipulating timestamps. We have produced some artificial traces internally, but they're one-offs. Have thought about more full-featured execution trace parser. Some loss of information like batches.
- Rhys: The tools I've been working on for traces have tests, and the tests are built around a string format. There are a lot more tools that work with profiles, and part of the reason is it's just easier. There's a protobuf format and a text format (e.g. heap profile text format, debug=1 or 2 or something). The bar for getting started with execution traces is much higher.
- Felix: There's the "-d" flag on go tool trace. Does that work?
- Rhys: Can round-trip most things, but not if it has a stack. Stacks don't work. Strings aren't quoted. Once you have the text format, there's nothing that reserializes it.
- Austin: go tool pprof does have a -raw flag (and a -traces flag?) for producing a text format.
- Felix: Another text format that's useful is Brendan Gregg's "folded" format. (Rhys: "stackcollapse"?)
- Michael P: Really interesting ideas. Talked in this meeting about having a public parsing package. Haven't talked about a public writing package.
- Michael K: Starting with a new parser, with a nicer API. Lower level, streamable. Might be not good to have multiple formats.
- Austin: It doesn't have to be another format, just a flag to say it's already sorted.
- Felix: One challenge with having a package where you can stream the events in and out is when you want to shift an event in time. Any API to allow really flexible trace construction would require having all the events in memory.
- Rhys: As far as having different formats goes, the old CPU profile format was streamed in whatever format the runtime produced. As soon as you put it through the pprof tool you got the canonicalized version.
- Felix: Interesting to keep in mind for testing purposes for a low-level trace parser is that you can round-trip the trace.
- Austin: If you want to do anything more complex like event insertion then you have to break the batches. It's not clear that you want to keep the capability to not break them.
- Michael P: Summarizing: use-cases we have for writing traces:
  - Anonymization
  - Insert trace events (post-instrumentation)
- Felix: Wanted to create an artificial trace for testing.
- Rhys: With pprof there's the -focus flag. Want to be able to cut out just the interesting part to avoid long wait times in tooling. For tests, want to be able to check-in something that I can read, but then generate the in-memory or binary format for that. If you want to write tests for the go tool trace UI it would currently be really onerous, even for little things like just the latency histograms. Text format would help.
Michael K: go.dev/fast idea. Go community survey asks for more optimization guidance. Could tell you how to write fast code, but also how to use diagnostic tooling. Would be interested in guest writers. Inspired by internal Google newsletters, but also abseil.io/fast.
- Michael P: Writing a whole guide is a huge undertaking. One motivation behind a newsletter is that it can be incremental.
- Felix: Would it be focused on particular examples, tutorial-style? Or is it more about the mental model?
- Michael P: I think it would be both. You probably have some baseline ones that are more about fundamentals, but some more tooling focused ones that go through examples. The ones inside of Google try to be fairly actionable. "Here's a thing you can actually do." They do still provide background, though.
- Michael K: Mental model vs. tutorial-style came up a lot. One thought is to build a "mental model" guide out of a constellation of tutorials. A piece of feedback I received a few times on the GC guide was that people wished it got down to specific examples and solving their problems, as opposed to building a mental model.
- Felix: Could be a good place to experiment with different ways of explaining things on different levels. A guide needs to commit to an approach, but this would let us shine a light on topics from different angles.
Michael K: RE: OOM-killing and core dumping, see https://lwn.net/Articles/590960/.
- Michael P: kdump works like this. Boots a new kernel in reserve memory to create a core dump of the kernel. But it does exist and does get used.

mknyszek commented 1 year ago

2023-07-06 Sync

Attendees: @mknyszek @aclements @prattmic @felixge @nsrip-dd @bboreham @rhysh

Michael P: Heads-up that I started doing refactoring on viewcore's internal libraries. Main goal was to add support for PIE binaries. It was a pain to edit the code, so I first did a refactor.
Michael K: I have some additional trace format tweaks to propose:
1. Put the string dictionary and stack table in specially-marked batches, so they can be discovered quickly when collecting batches.
2. Make trace events more uniform by removing the inline string exception.*
  - *Strings can appear either in the dictionary, or just before use. Either way, they allocate a unique ID. For an inline string, we would emit the string first, the parser would add it to the dictionary on-the-fly, and then the following event would reference it.
  - No additional memory use, small trace size increase (vs. today, emits an extra ID that appears twice), simpler tracer and parser. Allows for a more flexible "inline string" policy in the future by making all strings uniform.
Rhys: Experience report re: strings sometimes embedded in batches. When writing trace processors, it’s usually “run it and see if it works”, but that makes it really easy to miss these rare cases. Maybe the solution is more documentation. But in execution traces there are so many little interactions that happen only some of the time or basically never. This seems like another stumbling block.
- Felix: I think having an official parsing library should help with this to some degree?
- Rhys: It depends how fast it is. If you want a tool that’s “dump out all the strings”. Maybe you could do that very quickly if you went through all the batches and wrote out the string batches, but that would have a subtle bug in this case.
- Michael: I’m hoping the fact that all strings have the same structure will help. For a tool like that, look for all string events (ignoring “string block” header).
- Felix: I’ve written a tool to dump and anonymize all strings. It almost missed the inline strings. It sounds like we’re moving toward something less subtle.
- Rhys: I agree the consistency in that part of the format is an improvement.
- Michael: One design I considered was to emit the inline string into the string buffer. The string table would also hold a buffer, which would get flushed out regularly. “Inline strings” would go into this buffer but wouldn’t be stored in the intern table. Maybe I should just do that? Then we don’t need an exception at all.
- Austin: Yes.
- Rhys: maybe you need a "torture test", a trace file that contains all the rare cases ?
- Michael: We have that torture test.
(User logs in traces)
- Rhys: The user log strings wouldn’t get deduplicated?
- Right
- Rhys: I don’t work in an environment that has a lot of user logging, or a clear ideology on what to put in the category and what to put in the message. It’s not clear as a user how to use those. E.g., is it beneficial as a user to have a fixed log message followed by another with the variable parts? That’s one reason logging hasn’t been picked up much. There are a lot of unknowns on the overhead.
- Michael: With the switch to partitions, it becomes a lot more okay to use categories that don’t deduplicate well.
- Felix: With more people using execution tracing, we should think about how (and if) it would integrate with structured logging (slog). For our current use case (log span id) we found the existing logging to work fine.
- Michael: jba (working on slog) has been interested in how it integrates with the trace. Gets to the question of whether it’s better to push data down into the trace or push trace data up into other logs.
Combining/correlating logs
- Felix: Uncertain about pushing all logging down into the execution trace. Traces are more about execution (from the runtime’s perspective), logs are more about application-level concerns. Reinventing the wheel (e.g. OpenTelemetry’s Logging Signal) in execution traces might lead to weird situations down the road.
- Rhys: Some of the things that end up in user logs are sensitive and there's not a whole lot of information in execution traces that are sensitive in the same way. Wouldn't want to mix the two, unconditionally.
- Nick: One thought of things to pull up out of the trace are e.g. trace ID. Putting in logging to correlate trace data with higher-level spans. If there was a way to have the annotation in the trace already, that would be ideal.
- Austin: It's a false binary to push one trace into the other. A third approach is to correlate them really well.
- Rhys: Joke suggestion: goroutine ID. That's the thing that shows up really well in execution traces. There are good reasons for not putting that in userland, but something like that + a timestamp for correlation works.
- Austin: Kind of want to just export the goroutine ID (sub-joke) as a type that cannot be used with "==" (not comparable).
- Michael P: If you can convert it to a string, then plenty of users would be perfectly happy to just convert it back. If you can't convert it to a string, then you can't log it. slog-only doesn't help because the slog backend could just route it back to the application.
- Rhys: The thing that makes goroutine IDs so useful in execution traces is that it shows up on every event. If you miss the correlation key (say, it gets emitted 10ms before tracing starts) then you have a problem.
- Felix: Already plenty of ways to access goroutine IDs and the world hasn't ended. What stands between that and complete chaos is the runtime's strong stance against it.
- Michael K: I think all the ways to get it are slow?
- Felix: There are some packages that use linkname.
- Michael P: The easiest way is to just write assembly. linkname is fragile and you can't linkname getg.
- Michael K: Maybe worth revisiting goroutine labels?
- Felix: Labels assume a strong relationship between parent and child goroutines. Breaks down in multiple ways: in some cases you want the relationship and you don't get it, and in some cases you don't want the relationship and you do get it. In the end it's goroutine IDs connecting things.
- Rhys: Have had huge success when users put request ID in the pprof label.

dominikh commented 1 year ago

I'm leaving this here as it's somewhat related and possibly of interest: user_events: Enable user processes to create and write to trace events

dominikh commented 1 year ago

@mknyszek regarding our earlier conversation about EvGoSysCall and cgo calls emitting those events, I've noticed that in Go 1.21, calls to runtime.gcStart also emit EvGoSysCall.

Maybe it's best if we clean that up after all and make sure only actual syscalls emit that event. If we later decide that Cgo calls deserve to be included, too, they could get their own event type.

mknyszek commented 1 year ago

I've noticed that in Go 1.21, calls to runtime.gcStart also emit EvGoSysCall.

Huh... Thanks for pointing that out. That's a bit surprising to me; maybe a bug? Can you share the runtime part of the stack trace for the event? I'm curious as to where it's getting emitted. (Sounds benign, but maybe not!)

+1 to cgo calls just having their own event type though. Alternatively the syscall event could being a more general "we're leaving Go" event with a subtype byte (since otherwise we'd have to copy the complexity of the syscall event's sequencing into the cgo events).

dominikh commented 1 year ago

runtime.gcStart
        /home/dominikh/prj/go/src/runtime/mgc.go:667
runtime.mallocgc
        /home/dominikh/prj/go/src/runtime/malloc.go:1242
runtime.newobject
        /home/dominikh/prj/go/src/runtime/malloc.go:1324
runtime/trace.StartRegion
        /home/dominikh/prj/go/src/runtime/trace/annotation.go:158
main.fn1
        /home/dominikh/prj/src/example.com/foo.go:20

Where foo.go:20 is rg := trace.StartRegion(context.Background(), "foo region").

The Go version was go version devel go1.21-e6ec2a34dc Thu Jul 6 20:35:47 2023 +0000 linux/amd64

mknyszek commented 1 year ago

Thanks! The source of that is the notetsleepg in gcBgMarkStartWorkers which calls entersyscallblock. This case is kind of interesting, because it is a blocking syscall (the thread blocks in FUTEX_WAIT but the goroutine gets descheduled and its P is released). It's just one invoked by the runtime itself.

I'd argue maybe this one is working as intended? I see two possible problems here:

It is kind of noisy for the user debugging their own code to see this extra syscall that's entirely runtime-produced.
There are frames skipped which make it very unclear what the actual the source of the syscall is. Maybe the skip count is the right amount for any syscalls going through the syscall package, but it's not right for this case. This stack trace makes it appear as if the GC somehow represents a syscall but it's just a side-effect.

Maybe we capture more stack frames? That makes noisiness potentially worse, but in this case I think it would help a lot to identify "OK it's just the GC spinning up workers and waiting for them to start" from the extra stack frames. TBH hiding the extra stack frames and/or reducing noise from runtime-specific information seems more like a presentation problem and I'd be inclined to hide fewer frames in the trace itself (and leave it up to the UI to hide things), but I'm biased since those frames are ~always useful to me personally. :)

(From a performance perspective, I doubt including those frames would hurt much. Stack traces are likely to be just as dedup-able and the extra few frames would only increase trace size slightly.)

dominikh commented 1 year ago

Based on your analysis I agree that this is working as intended. I'm also in favor of capturing more stack frames. With a proper stack, this event might even be useful to users, since it's a possible source of latency. And if it's too noisy, the UI should allow filtering it away.

TBH hiding the extra stack frames and/or reducing noise from runtime-specific information seems more like a presentation problem

It wouldn't be the first time that Gotraceui analyses stack traces, either. I already do so to walk backwards out of the runtime for events such as blocking channel sends (the user doesn't care that this happens in runtime.chansend1).

mknyszek commented 1 year ago

2023-07-20 Sync

Attendees: @mknyszek @prattmic @rhysh @bboreham @dominikh

Michael K: Continuing to make progress, but nothing new to talk about.
Michael P: FYI the tree is going to be reopened for development imminently.
Rhys: Filed some bugs recently about lock contention. Filed about a runtime/metrics metric to measure runtime-internal lock contention. Michael P mentioned that it would be nice to write up runtime-internal locks to the contention profile.
Dominik: I briefly wondered about sync.Mutex and execution tracing and if there could be a (cheap) unique ID for mutexes to analyze contention in detail (i.e. find goroutines blocked on the same mutex).
Michael K: They do have a unique ID: their address.
Dominik: The problem with the address would be address reuse.
Michael P: Addresses could be reused, but you're tracking contention on small timescales, and unless a GC occurs, addresses won't be reused.
Rhys: There is sometimes lock contention in the http2 implementation: something grabs the lock for the whole transport, then does a blocking write to the http2 connection, but blocks everything (seen for up to 15 minutes). These are just bugs, but the way I've found them is via goroutine profiles. The unique ID would be helpful to filter down which goroutine owns and/or is impacted by the lock.
Michael P: What does it mean for a goroutine to hold a lock? Locks can be released by other goroutines.
Rhys: Would still help for a lot of cases.
Michael P: My old team's locking was complicated. I made a wrapper for sync.Mutex that enforced unlocking by the same goroutine, and there was exactly one case found in the relatively large codebase where a different goroutine released the lock.
Rhys: It's useful to see how goroutines interact. I see this information when mutexes are contended, but not when they're uncontended. I wonder if it would be useful to occasionally sample uncontended locks as well to understand the structure of the program when it's not overloaded.
Michael P: That's an interesting idea. Would you imagine that it would be integrated into a contention profile? (Contention is zero on a sample.) Or would it be a separate profile?
Rhys: I was thinking of execution traces, e.g. this goroutine would have unblocked this other goroutine if they were executing differently. Right now I get no data about locks that might become contended until they are.
Michael P: In the limit, it would be nice if every lock and unlock was a trace event, but that's very expensive.
Dominik: It would probably be too expensive to emit lock/unlock events for uncontended locks, right?
Rhys: Maybe we can do 1% or 1 per million of lock/unlock pairs.
Michael P: Uncontended lock is already pretty fast: CAS and you're done. But maybe a (non-atomic, per-P) counter or something is fine with a slow path if that counter gets big.
Rhys: Also applies to channels.
Dominik: And the unlock checks a flag to tell if the lock was sampled?
Michael K: I think that flag would have to be on the mutex.
Michael P: (a) It's not immediately obvious that we should sample the unlock (but that would be a nice property. (b) There are unused state bits in the mutex state. We probably couldn't use it for a counter, but we could probably use it for a flag.
Rhys: I think this nice property is important for execution traces. "Had this unlock not happened, this lock would not be able to proceed."
Bryan: The mention of RWMutex made me think about getting data on how useful RWMutex actually is. For example, how often the program is actually taking advantage of the read lock. I often wonder if the implementation cost is worth it.
Rhys: One problem is that they look like they're reentrant until they're not. A useful tool would be to identify if you're assuming the lock is reentrant.
Michael K: You can imagine that the lock implementation detects it by writing down the G.
Michael P: The problem here is that you don't have to unlock from the same G. You can do weird things like handing off lock ownership and (surprisingly) sidestepping deadlocks. This makes detection more difficult.
Dominik: Would putting the address in the trace be difficult?
Michael K: I don't think it would be that difficult. It would make the trace bigger, but you can have a side-table.
Michael P: You could imagine a block context event or something that includes that address and comes after the block event.
Dominik: I'm not sure how much address reuse actually matters. The address will refer to the same mutex for lock and unlock pairs, and if two goroutines block on the same address'd mutex, it has to be the same mutex, too. Sampling all locks/unlocks would be a different story, however.
- Although I'm not sure how this interacts with stack moving/the elusive "moving GC".
Rhys: Is there data that would be helpful for you for debugging the three issues I filed (61426, 61427, 61428)? In other words, is there any way we could improve runtime diagnostics that would make those issues issue to resolve?
Michael K: I think runtime-internal mutex contention might be sufficiently covered by CPU profiles? Because they lock up the whole thread, so samples land there. This isn't true for sync.Mutex.
Rhys: Doesn't capture details in certain cases, e.g. channel lock contention. Can't distinguish individual mutexes in a profile. Filed an issue about this a while ago. It would be useful in this case too to have some identifier for the lock.
Michael P: The sync.Mutex contention profile would benefit from a unique identifier or something to distinguish whether it's one mutex or 5. Possibly via a label.
Rhys: The contention profile accumulates over all time, so attaching labels to it means that until we figure out something clever, that also leaks memory over all time. So we can't have labels.
Michael P: sync.Mutex can appear on a CPU profile because the contended locker has to call into the runtime scheduler. It's just not quite as expensive as calling futex.
Michael P: I think the problem is slightly less bad for mutex profiles, but once you turn them on they keep everything forever. Could have a diff profiling mode.
Michael K: Could a diff profiling mode have a label?
Rhys: An execution trace is start and stop, but you can't get goroutine labels from it, and you can't get allocation events. Well, you kind of can with the HeapAlloc event.
Michael K: That's super interesting; I think HeapAlloc being a good sampler is sort of accidental, but that's a good use-case. I'm not certain that was intended…
Michael P: Hyrum's Law!
Rhys: We do collect a stack every half-megabyte to put into an allocation profile. If we put that into an execution trace too, that would be fine. Over time we could see that a certain type of request allocates a certain type of memory looking over many samples.
Michael K: Wrote an allocation tracer similar to the execution trace that emits an event for every alloc and free. Had about a 30% allocation overhead, but the traces were huge.
Dominik: There's a GODEBUG flag to trace all allocs, isn't there?
Michael K: Yeah, but it's really, really slow, since it just writes to STDERR.

tbg commented 1 year ago

Just a small comment,

Rhys: One problem is that they look like they're reentrant until they're not. A useful tool would be to identify if you're assuming the lock is reentrant.

https://github.com/sasha-s/go-deadlock does this (at runtime) by recording and comparing the stack frames seen for each (manually instrumented) mutex (very expensive, we only use it in nightly stress testing). Of course this requires observing a re-entrant lock acquisition. It seems difficult to do this sort of thing statically, but even doing it cheaply enough to have it always on would be helpful.

prattmic commented 1 year ago

@tbg We discussed detection briefly. It is unclear to me how to do robust deadlock detection given that lock and unlock are not required to be on the same goroutine. e.g., consider:

Goroutine 1 calls RLock.
Goroutine 2 takes "ownership" of this critical section.
Goroutine 1 calls RLock again.
Goroutine 2 calls RUnlock.

Step 3 naively looks like a deadlock, but as long as goroutine 2 is guaranteed to make progress it is not. How does go-deadlock handle this situation?

tbg commented 1 year ago

It gives a false positive:

https://github.com/sasha-s/go-deadlock/blob/5afde13977e624ab3bd64e5801f75f9e8eb1f41b/deadlock.go#L284-L300

It's difficult to handle locks that pass between goroutines. Which is a shame, because it's also pretty rare.

mknyszek commented 1 year ago

2023-08-03 Sync

Attendees: @mknyszek @rhysh @nsrip-dd @felixge @bboreham @dominikh

Michael K: First pass implemented, no parser yet. (CL) Veered away from doc a little based on learnings from implementation. Couldn’t get around a brief STW on trace start. Advancing the partitions or stopping the trace doesn’t need STW.
Rhys: Can you quantify “brief STW”?
Michael K: It stops the world, prevents sysmon from doing its thing, sets global trace.on value, emits on event per P but this might not need to be done during STW. Briefly spins waiting for things to exit syscall. Couldn’t figure out how to get around that.
Rhys: Syscall stuff is a wild card, but otherwise it sounds like less STW than what we see from a GC?
Michael K: Yeah, it’s extremely brief. Getting attention of every G in a non-racy way was difficult and changed the design a bit.
Dominik: Will STW be recorded?
Michael K: There's currently a bug that it doesn't, but it easily can.
Rhys: Some of the applications I collect traces from have long GCs, and tracer can't start or stop during a GC cycle, so I miss things, especially when I'm collecting only 1 second traces. Does the new tracer need to wait for a GC to end?
Michael K: It does, but that restriction can be easily lifted in the future. The trace just needs to contain more information about what's currently in progress.
Felix: Can waiting for syscall exit could cause a very long stop-the-world since goroutines might be blocking on I/O?
Michael K: Ah, no. To clarify, the syscall exit is just making sure that there are no threads in a very short critical section on the syscall exit path. Threads are guaranteed to drain from that critical section.
Rhys: Could the spinning for that wait still be a problem, say if that thread gets descheduled?
Michael K: It could, but it's very short (hundreds of nanoseconds at most), and working around that would require the tracer to be wait-free. Though, any thread writing to the trace is indeed wait-free. The core of the design is that every thread holds its own seqlock (thread-local atomic counter) while it's writing events. The tracer then makes sure nothing is writing to the trace by ensuring the counter is not observed to be an odd number (indicating that it's in a critical section).
Michael K: As an aside, I think I may have also mitigated some of the issues with the CPU profile samples being inserted into the tracer only in a best-effort manner. I basically just spun up a different goroutine because it was simpler. There may still be some data loss across trace generations (partitions) since the CPU profile buffer isn't forcibly flushed on each generation. We can probably do a bit more to minimize that, but on the whole the loss should be small.
Rhys: Does this mean we also get M information for each profile sample?
Michael K: Yup!
Nick: You mentioned you're planning on writing a parser. There was some discussion of fixing up timestamps. Have you reached that point yet?
Michael K: Not yet, but I expect it to be fairly straightforward. The algorithm is going to treat event partial orders as the source of truth and try to treat timestamps as annotations. It can't in general, but my goal is to keep it as close to that as possible. The timestamp fixup will just happen when timestamps don't make sense with respect to the partial order.
Dominik: Will we end up with two parsers, one of the old, one for the new?
Michael K: My goal was no, but I'm starting with a parser for the new one and then I'll go back and retrofit the old format.
Dominik: Can fixing up timestamps introduce severe inaccuracies?
Michael K: No more inaccurate than the timestamps being out of order in the first place! As a side note, I plan to write a wire format spec at some point.
Dominik: Would it be possible for two events to have the same timestamp?
Michael K: Yeah, it'll be possible.
Rhys: It's useful in order to identify events uniquely. But that's fine actually, a good alternative is the byte offset in the trace which is what I already do.
Michael K: Yeah, I think you'll have to keep doing that.
Rhys: One interesting property about the current trace is that CPU profile samples are put in the trace buffer a bit after the stack is collected. So if you look closely in the trace you can see it shifted right quite a bit and get an idea of how long it takes to write a trace event.

mknyszek commented 1 year ago

2023-08-17 Sync

Attendees: @mknyszek @aclements @felixge @nsrip-dd @rhysh @dominikh

Austin: Wondering if anyone had seen runtime.step or runtime.pcvalue in their profiles, re: PC value data encoding experiments.
Rhys: I have an app that does a lot of copystack. 6% in runtime.pcvalue+runtime.step in the profile with “the most pcvalue”
Austin: That's where I'd expect to see it hit the hardest. I suspect it’s pretty low on average, but some apps hit it hard.
Rhys: This app starts a lot of goroutines that grow their stacks.
Michael K: I’ve been working on the trace parser. I’m getting to the partial ordering part of the parser. I got all of the structural elements out of the way. I plan to start with just a streaming API and to put it in x/exp. What would be the best way to have this parser both exist publicly and use it from within std? Exp isn’t great for this because we don’t want to vendor exp. Maybe we could vendor x/debug?
Austin: What’s so bad about vendoring from exp?
Michael K: I suspect Russ will be like “we do not want to depend on exp in any official capacity”. I do want this to live in two places so people can start using it, with the knowledge that the API might break. The long-term plan is to have a stable API, but the devil’s in the details of how events look.
Michael K: Leaning toward opaque value types. I want to leave the option open for lazy decoding. E.g., switch over event type, then say “event.GoStart()” and that returns the event as a GoStart.
Rhys: I will note there's a lot of stuff vendored into cmd.
Felix: Lazy decoding sounds good from the perspective of online processing.
Michael K: Right now it’s a completely streaming API, but I want to be able to derive one of these streamers from a ReaderAt. Mmap the entire trace, and the reader either does a buffered read or keeps just offsets to minimize the number of copies that happen. We could keep references into a mmapped byte slice.
Rhys: For the programs I have that read execution trace data, often they first sort it by goroutine. I don’t know if there’s a way to do that that’s still streaming without loading the entire thing into memory. If there was a way to have multiple iterators without having to reparse the entire file…
Michael K: Because of partial orders across Ms, things have to come out in time order and obey the partial order. If you’re just streaming it out, the parser already has to have state for every G and P, you could process it in total order, but copy the events out to separate output streams for each G, etc.
Rhys: Could we have an API that allows filtering over the streaming data, without discarding all of the work it does to figure out where the chunk boundaries are? But maybe calculating the partial orders is expensive enough that you have to keep everything in memory.
Michael K: No matter what you have to buffer an entire partition at a time.
Felix: Re sorting events by G, how do you handle events that refer to two goroutines?
Rhys: I look at the goroutine that emitted the event. E.g., if one goroutine unblocks another, we have the Unblock on the source and a GoStart on the target.
Michael K: I was expecting the parser API to be pretty low level. E.g., for go tool trace, I expect this to sit below the current high-level allocation-heavy work it does (at least to start). If you want to do higher-level analysis on top of that, it should be possible. If you want to keep more state around, that should be up to the consumer.
Austin: I think there are asymptotic issues at play. You can write processed streams out to disk, and I think that might be necessary to scale up arbitrarily.
Rhys: Pprof is an example of writing to disk. E.g., you can process a pprof with a focus flag and write out a much smaller profile that you can then work with further. Maybe the first time you run “go tool trace” it rewrites the trace in sorted order and flags it as already-sorted. Or add filtering in the future. Going to disk is fine, but I want to keep things in the same format so the ecosystem keeps working.
Dominik (Chat): what I found is that no order is convenient for all analyses of the events. All of G, P, and M orders are useful
Michael K: That might be doable. I’m reluctant to have a second format.
Michael K: Filtering by G is interesting… Early on, Russ said to me that a trace is most useful when you have the full context.
Austin: If people have a lightweight way to refer to events, is that useful? It wouldn't have all the order information (so you couldn't seek the streamer to that point, though you could do that but it would be much heavier-weight).
Many: nod
Rhys: The equivalent of focus for an execution trace is "I have a 5min execution trace, but I just want to see this 30 second window starting here." One of the things you can do with go tool pprof is you can have 100 profiles, but all these samples are from the same machine so you can merge them all together by adding a label. The equivalent in the execution trace might be to process a trace and add extra information, say what request a goroutine was handling, and write that back out to be processed by other tools.
Michael K: For focusing on a time window, partitions get you that. Slicing a trace is very easy now. You still have to go over the trace to figure out the time windows (which is very fast), but then you can copy out the partitions you need. Long-term I proposed the trace itself could write out a partition table at the end so you don’t have to do that yourself, but I’m parking a bunch of features so I can just get this done. (Also self-description.)
Felix: On our end we're creating tasks for every request that we're tracing as part of DataDog's distributed tracing, and we're using the logging feature to log the span ID (request ID). We've been able to build this ourselves but it would be nice to have this for all the tools. We're currently considering goroutines involved the way pprof does (inherited from parents). Having a way to slice-and-dice based on these user-level log messages seems useful.
Michael K: Is the issue with the current API that you can have tasks, but you can’t associate your own data with it? If you could associate arbitrary data with the task, would that be helpful?
(A task just has a single string name. You can then log with key/value pairs.)
Dominik (Chat): Yeah. Right now, the only custom data we have is logging strings, and the association is implicit, and really bound to a G, not a task. Arguably it could also be useful for regions.
Felix: I don’t necessarily need the information associated at task creation time, as long as tools agree on how to process this.
Rhys: More on annotations, when you make an HTTP request you don't know what the method is, the account ID, etc. They're things you might want to associate with a task but you don't know until later. There are also places where interesting things happen and it's not possible to annotate with tasks and regions. There's DNS requests that happen, for instance. I think it's powerful to be able to add instrumentation after the fact and be able to write that out and use that with all the existing tools.
Michael K: Writing out an updated trace should be straightforward. I'm not sure I follow the instrumentation after the fact.
Rhys: The idea is to look for events with a particular stack that look a particular way.
Michael K: This annotation seems somewhat fragile, and so I was thinking of std packages pushing trace events into the execution trace.
Rhys: It is fragile, but it only changes once every six months (at most) and it's easy to update. It also depends on having more than just one stack frame of context.
Michael K: Yeah, I abandoned that idea. It's not actually meaningfully faster.
Rhys: Also sometimes I want to go back in time and figure out how many resources (CPU time, memory allocations) it took to figure out that we even had an HTTP request to handle.
Dominik (Chat): We should put runtime/metrics into the trace.
Michael K: Agreed. Michael P recently suggested the same thing.
Michael K: Felix, you mentioned issues with worker pools. Couldn't it be instrumented with regions and contexts passed in? You don't necessarily need the library to instrument.
Felix: I consider all the various background goroutines as part of that, and that's basically impossible to instrument.
Nick (Chat): The main missing spot is non-blocking channel operations. If the worker pool is also instrumented, it doesn't matter. But otherwise you only have the correlation if there's some synchronization that points to the worker goroutine.
Dominik (Chat): Nick, but regions are annotated with the task from the context, isn't that sufficient?

mknyszek commented 1 year ago

2023-08-31 Sync

Attendees: @mknyszek @rhysh @bboreham

Most people were out. Michael K and Rhys discussed a few things, but there isn't much to note down that wasn't mentioned later. Bryan had to leave early.