golang / go

The Go programming language
https://go.dev
BSD 3-Clause "New" or "Revised" License
124.22k stars 17.7k forks source link

proposal: runtime/pprof: add data-type profiling #69699

Open florianl opened 1 month ago

florianl commented 1 month ago

Proposal Details

Proposal Details

With field reordering and padding of structs, static analysis can help to improve memory layouts of Go structs. This can lead to a more efficient way to access struct fields, as the fields within the struct are aligned to some degree. Combined with dead code analysis, unused fields in structs can be identified by static analysis and help to reduce the size of structs.

This proposal tries to introduce the ideas from Data-type profiling for perf to Go's pprof ecosystem to provide a Go native approach. Today it is already possible with perf on Unix systems to do data-type profiling, reorder structs accordingly and benefit from the performance improvements.

Introduce a new runtime/pprof Profile that tracks the number read/write accesses of fields within a Go struct.

The report of this new runtime/pprof Profile should enable users to identify often used fields within a struct, in order to reorder struct fields to improve memory efficiency of their application.

Example reporting of for a Go struct generated by the approach described in Data-type profiling for perf:

Annotate type: 'struct runtime.mspan' (654 samples)
Percent     Offset       Size  Field
 100.00          0        160  struct runtime.mspan {
   0.00          0          0      internal/runtime/sys.NotInHeap   _ {
   0.00          0          0          internal/runtime/sys.nih     _;
                                   };
   1.05          0          8      runtime.mspan*   next;
   0.00          8          8      runtime.mspan*   prev;
   0.23         16          8      runtime.mSpanList*       list;
  41.18         24          8      uintptr  startAddr;
   2.30         32          8      uintptr  npages;
   0.19         40          8      runtime.gclinkptr        manualFreeList;
   1.74         48          2      uint16   freeindex;
   1.57         50          2      uint16   nelems;
   0.23         52          2      uint16   freeIndexForScan;
   1.82         56          8      uint64   allocCache;
   1.56         64          8      runtime.gcBits*  allocBits;
   5.51         72          8      runtime.gcBits*  gcmarkBits;
   0.42         80          8      runtime.gcBits*  pinnerBits;
   1.54         88          4      uint32   sweepgen;
   4.58         92          4      uint32   divMul;
   2.70         96          2      uint16   allocCount;
  12.49         98          1      runtime.spanClass        spanclass;
   0.00         99          1      runtime.mSpanStateBox    state {
   0.00         99          1          internal/runtime/atomic.Uint8        s {
   0.00         99          0              internal/runtime/atomic.noCopy   noCopy;
   0.00         99          1              uint8    value;
                                       };
                                   };
   1.69        100          1      uint8    needzero;
   0.11        101          1      bool     isUserArenaChunk;
   0.23        102          2      uint16   allocCountBeforeCache;
  18.64        104          8      uintptr  elemsize;
   0.00        112          8      uintptr  limit;
   0.00        120          8      runtime.mutex    speciallock {
   0.00        120          0          runtime.lockRankStruct       lockRankStruct;
   0.00        120          8          uintptr      key;
                                   };
   0.22        128          8      runtime.special* specials;
   0.00        136         16      runtime.addrRange        userArenaChunkFree {
   0.00        136          8          runtime.offAddr      base {
   0.00        136          8              uintptr  a;
                                       };
   0.00        144          8          runtime.offAddr      limit {
   0.00        144          8              uintptr  a;
                                       };
                                   };
   0.00        152          8      internal/abi.Type*       largeType;
                               };

The above shown example reports the field access of the Go internal struct mspan while running the benchmarks in net/http with go version devel go1.24-eb6f2c24cd Sat Sep 28 01:07:09 2024 +0000 linux/amd64.

Alternative

Instead of introducing a new runtime/pprof Profile, a similar approach to go build -cover could be used. During build time access to fields in Go structs could be instrumented and a report should be generated when executing the resulting Go binary. The resulting report then can be used by go tool cover to report the number of times a field in a struct was accessed.

Question

I'm lacking Go runtime internal knowledge to provide a proof of concept with this proposal.

gabyhelp commented 1 month ago

Related Issues and Documentation

(Emoji vote if this was helpful or unhelpful; more detailed feedback welcome in this discussion.)

prattmic commented 1 month ago

This is a very intriguing type of profile I’ve never heard of before.

Do you intend that this profile would work the same way as the linked perf profile type? That is, using a precise “memory access” (or memory load) hardware PMU metric.

Aside: I found the patch message to be the most straightforward and concise summary of how that profile works: https://lwn.net/Articles/954938/

Along those lines, do you know if the existing perf profile works on Go programs? I don’t see fundamental reasons it shouldn’t, but we may be missing some DWARF. So even if we don’t add a profile to runtime/pprof, fixing up problems with perf profiles may be doable.

prattmic commented 1 month ago

cc @golang/runtime

florianl commented 1 month ago

Do you intend that this profile would work the same way as the linked perf profile type? That is, using a precise “memory access” (or memory load) hardware PMU metric.

Implementing this new profile based on PMU metrics would benefit accuracy, I think. I'm missing Go runtime internal knowledge to tell whether there is an option implementing it without PMU metrics.

Along those lines, do you know if the existing perf profile works on Go programs?

I'm using perf whenever it is available and so far I didn't run into issues or did miss some information when profiling Go executables. The given example of struct runtime.mspan in the initial post of this proposal was generated by perf. To my knowledge, perf is not available on every OS, e.g. I'm not aware of perf on windows. Also perf is often not deployed to production systems. Therefore, the Go ecosystem would benefit from insights of this new profile if it is integrated natively.

prattmic commented 1 month ago

That's great to hear that the perf tool seems to work well.

36821 and #53286 cover providing PMU-based profiles in Go, though those are targeted at the more typical profiles (cycles, instructions, etc). Cross-platform support is discussed there as well. I believe the summary is that Linux of course has the perf events API, Windows has an API, though none of us are familiar with it, and macOS does not seem to have a (public) API at all.