golang / go

The Go programming language
https://go.dev
BSD 3-Clause "New" or "Revised" License
124.14k stars 17.69k forks source link

cmd/compile: -bench should correct for GC #17434

Open mdempsky opened 8 years ago

mdempsky commented 8 years ago

Currently -bench output is very sensitive to GC effects. For example:

  1. Changing allocations in phase A might cause a GC cycle to shift from phase B to phase C, which can look like an improvement to B and a regression for phase C.
  2. Reducing long-lived memory pressure from earlier phases gets credited to later phases, as the later phases benefit from reduced GC costs.

This makes it hard to isolate performance improvements from frontend vs backend changes.

I'm considering a few possible improvements to -bench:

  1. Record GC pause times, and subtract them from phase times.
  2. Record allocation stats for each phase.
  3. Explicit GC cycle between FE and BE so we can measure how much live memory the FE has left for the BE to work with.

Any other suggestions and/or implementation advice?

/cc @griesemer @rsc @aclements

aclements commented 8 years ago

Record GC pause times, and subtract them from phase times.

I don't see why this would help. GC pause times are close to 0 and getting closer. It's not the pauses that are the problem, it's the CPU taken away from the compiler during the concurrent phase.

Record allocation stats for each phase.

This seems like a good idea in general.

Explicit GC cycle between FE and BE so we can measure how much live memory the FE has left for the BE to work with.

Adding explicit GCs between phases seems necessary if you're going to isolate the performance of the different phases. I'm not sure what you mean by "live memory the FE has left" since live memory isn't something that's left, but I don't think this is about measurement anyway. Doing an explicit GC between phases resets the pacing so the scheduling of GCs during each phase is much closer to independent from the other phases (not exactly, since a change in the live memory remaining after an earlier phase can still affect the GC scheduling in a later phase, but you'll be much closer to independence).

mdempsky commented 8 years ago

I don't see why this would help. GC pause times are close to 0 and getting closer. It's not the pauses that are the problem, it's the CPU taken away from the compiler during the concurrent phase.

I see. I said "GC pause times" just because that's the only time duration that I could see in package runtime.MemStats or runtime/debug.GCStats, and I naively assumed it somehow represented total GC overhead. I guess it actually means only STW time?

Is there a good way to measure CPU cost from the concurrent phase? Also, currently I think we only measure per-phase wallclock time. I wonder if we need to measure per-phase CPU-seconds instead, since the GC is concurrent (and possibly the compiler itself will be too, in the future).

I'm not sure what you mean by "live memory the FE has left" since live memory isn't something that's left, but I don't think this is about measurement anyway.

I meant (for example) to make an explicit runtime.GC() call at the end of the frontend phases and record the runtime.MemStats.Heap{Alloc,Objects} values. The hypothesis being that 1) they represent how much data the FE has allocated that will continue to remain live throughout the BE phases, and 2) improving those numbers should reduce the amount of GC work necessary during the BE phases. Is that sound, or is my model of GC effects too naive?

aclements commented 8 years ago

I see. I said "GC pause times" just because that's the only time duration that I could see in package runtime.MemStats or runtime/debug.GCStats, and I naively assumed it somehow represented total GC overhead. I guess it actually means only STW time?

Right. The only thing in MemStats that accounts for concurrent GC time is GCCPUFraction, but I don't think that would help here.

Also, currently I think we only measure per-phase wallclock time. I wonder if we need to measure per-phase CPU-seconds instead, since the GC is concurrent (and possibly the compiler itself will be too, in the future).

I'm not so sure. What people generally care about when they're running the compiler is how long it took, not how many CPU-seconds it took.

I meant (for example) to make an explicit runtime.GC() call at the end of the frontend phases and record the runtime.MemStats.Heap{Alloc,Objects} values. The hypothesis being that 1) they represent how much data the FE has allocated that will continue to remain live throughout the BE phases, and 2) improving those numbers should reduce the amount of GC work necessary during the BE phases. Is that sound, or is my model of GC effects too naive?

I think that's a good thing to measure, however, the effect is somewhat secondary to just how many allocations the FE does. To a first order, if the FE retained set doubles, each GC during the BE will cost twice as much but they will happen half as often, so the total cost doesn't change. (It does matter to a second order, since longer GCs are less efficient GCs because of write barrier overheads and floating garbage.)

However, my point about running a GC between phases just to reset the GC pacing still stands. Imagine the GC runs exactly 1 second, 2 seconds, etc. after the process starts; if you change the time some phase takes, all of the later phases will line up with the GC ticks differently, causing fluctuations in measured performance. runtime.GC() lets you reset the clock, so if you do it at the beginning of each compiler phase, only that phase's "timing" will matter for its own measurement. The GC actually runs in logical "heap time", but the analogy is quite close.

quentinmit commented 8 years ago

It seems like -bench should turn on an explicit GC at the end of each phase, counted against that phase's timing.