llvm / llvm-project

The LLVM Project is a collection of modular and reusable compiler and toolchain technologies.
http://llvm.org
Other
28.04k stars 11.58k forks source link

perf2bolt maps linux perf samples incorrectly for small binaries #109384

Open kbeyls opened 3 hours ago

kbeyls commented 3 hours ago

When experimenting with perf2bolt on small programs, it looks like at least sometimes it does not map samples to the correct instructions.

The root cause seems to be a failing heuristic in perf2bolt; more precisely the heuristic used to map instruction addresses in perf samples (i.e. addresses as they appear in running processes) to addresses of instructions in the ELF binary.

Detailed description of experiment and observations

I've observed this on at least almabench from the llvm test-suite.

I observed this on an AWS Graviton4 machine running Ubuntu 22.04.1; building the test-suite with the following optimization flags -O3 -flto=thin -g -fuse-ld=lld -Wl,--emit-relocs.

Perf samples were collected as follows:

$ cd SingleSource/Benchmarks/CoyoteBench
$ perf record -e cycles:u  ./almabench

perf2bolt was invoked as follows:

$ ~/dev/llvm/build_rel/bin/perf2bolt -p perf.data -o almabench.perf.perf2bolt.fdata -nl ./almabench
BOLT-INFO: shared object or position-independent executable detected
PERF2BOLT: Starting data aggregation job for perf.data
PERF2BOLT: spawning perf job to read events without LBR
PERF2BOLT: spawning perf job to read mem events
PERF2BOLT: spawning perf job to read process events
PERF2BOLT: spawning perf job to read task events
BOLT-INFO: Target architecture: aarch64
BOLT-INFO: BOLT version: 86e5c5468ae3fcd65b23fd7b3cb0182e676829bd
BOLT-INFO: first alloc address is 0x0
BOLT-INFO: creating new program header table at address 0x200000, offset 0x200000
BOLT-INFO: enabling relocation mode
BOLT-INFO: enabling strict relocation mode for aggregation purposes
BOLT-INFO: pre-processing profile using perf data aggregator
BOLT-WARNING: build-id will not be checked because we could not read one from input binary
PERF2BOLT: waiting for perf mmap events collection to finish...
PERF2BOLT: parsing perf-script mmap events output
PERF2BOLT: waiting for perf task events collection to finish...
PERF2BOLT: parsing perf-script task events output
PERF2BOLT: input binary is associated with 1 PID(s)
PERF2BOLT: waiting for perf events collection to finish...
PERF2BOLT: parsing basic events (without LBR)...
PERF2BOLT: waiting for perf mem events collection to finish...
PERF2BOLT: processing basic events (without LBR)...
PERF2BOLT: read 13429 samples
PERF2BOLT: out of range samples recorded in unknown regions: 13429 (100.0%)

 !! WARNING !! This high mismatch ratio indicates the input binary is probably not the same binary used during profiling collection. The generated data may b
e ineffective for improving performance.

PERF2BOLT: wrote 0 objects and 0 memory objects to almabench.perf.perf2bolt.fdata
BOLT-INFO: 0 out of 11 functions in the binary (0.0%) have non-empty execution profile

Comparing with what perf script shows

$ perf script --show-mmap-events | wc -l
13434
$ perf script --show-mmap-events
       almabench 1356276 446583.793818: PERF_RECORD_MMAP2 1356276/1356276: [0xaaaaaaab1000(0x2000) @ 0x1000 103:01 4733784 3684065051]: r-xp /home/kribey01/d
ev/llvm/build_test_suite-LTOthin-O3-bolt/SingleSource/Benchmarks/CoyoteBench/almabench
       almabench 1356276 446583.793832: PERF_RECORD_MMAP2 1356276/1356276: [0xfffff7fc2000(0x3e000) @ 0 103:01 21002 578735524]: r-xp /usr/lib/aarch64-linux-
gnu/ld-linux-aarch64.so.1
       almabench 1356276 446583.793844: PERF_RECORD_MMAP2 1356276/1356276: [0xfffff7ffb000(0x1000) @ 0 00:00 0 0]: r-xp [vdso]
       almabench 1356276 446583.793873:          1 cycles:u:  ffffdb66ec538f08 [unknown] ([unknown])
       almabench 1356276 446583.793895:          1 cycles:u:      fffff7fd9c40 _start+0x0 (/usr/lib/aarch64-linux-gnu/ld-linux-aarch64.so.1)
       almabench 1356276 446583.793912:          1 cycles:u:      fffff7fda610 _dl_start+0x0 (/usr/lib/aarch64-linux-gnu/ld-linux-aarch64.so.1)
       almabench 1356276 446583.793929:          3 cycles:u:  ffffdb66ec5389d8 [unknown] ([unknown])
       almabench 1356276 446583.793947:          9 cycles:u:  ffffdb66ec5389d8 [unknown] ([unknown])
       almabench 1356276 446583.793966:         25 cycles:u:      fffff7fda648 _dl_start+0x38 (/usr/lib/aarch64-linux-gnu/ld-linux-aarch64.so.1)
       almabench 1356276 446583.793985:         66 cycles:u:      fffff7fdaa5c _dl_start+0x44c (/usr/lib/aarch64-linux-gnu/ld-linux-aarch64.so.1)
       almabench 1356276 446583.794002:        167 cycles:u:      fffff7fda6d0 _dl_start+0xc0 (/usr/lib/aarch64-linux-gnu/ld-linux-aarch64.so.1)
       almabench 1356276 446583.794020:        414 cycles:u:  ffffdb66ec5389d8 [unknown] ([unknown])
       almabench 1356276 446583.794040:       1145 cycles:u:      fffff7fda8f8 _dl_start+0x2e8 (/usr/lib/aarch64-linux-gnu/ld-linux-aarch64.so.1)
       almabench 1356276 446583.794056:       2965 cycles:u:      fffff7fd5a40 __GI___tunables_init+0xe0 (/usr/lib/aarch64-linux-gnu/ld-linux-aarch64.so.1)
       almabench 1356276 446583.794078:       7286 cycles:u:      fffff7fdb978 dl_main+0x9f8 (/usr/lib/aarch64-linux-gnu/ld-linux-aarch64.so.1)
       almabench 1356276 446583.794121: PERF_RECORD_MMAP2 1356276/1356276: [0xfffff7f10000(0x97000) @ 0 103:01 21020 4096449059]: r-xp /usr/lib/aarch64-linux-gnu/libm.so.6
       almabench 1356276 446583.794150:      19937 cycles:u:      fffff7fcd6cc _dl_name_match_p+0x2c (/usr/lib/aarch64-linux-gnu/ld-linux-aarch64.so.1)
       almabench 1356276 446583.794167: PERF_RECORD_MMAP2 1356276/1356276: [0xfffff7d60000(0x1a9000) @ 0 103:01 21006 1342113607]: r-xp /usr/lib/aarch64-linux-gnu/libc.so.6
       almabench 1356276 446583.794228:      45966 cycles:u:      fffff7fd0e80 _dl_relocate_object+0x660 (/usr/lib/aarch64-linux-gnu/ld-linux-aarch64.so.1)
       almabench 1356276 446583.794305:      60218 cycles:u:      fffff7f426a8 __atan2_finite@GLIBC_2.17+0x3d8 (/usr/lib/aarch64-linux-gnu/libm.so.6)
       almabench 1356276 446583.794348:      76663 cycles:u:      fffff7f1f340 __atan2+0x40 (/usr/lib/aarch64-linux-gnu/libm.so.6)
       almabench 1356276 446583.794399:      98506 cycles:u:      fffff7f1f340 __atan2+0x40 (/usr/lib/aarch64-linux-gnu/libm.so.6)
       almabench 1356276 446583.794471:     157477 cycles:u:      fffff7f487d8 __cos+0x8 (/usr/lib/aarch64-linux-gnu/libm.so.6)
       almabench 1356276 446583.794571:     234290 cycles:u:      fffff7f4249c __atan2_finite@GLIBC_2.17+0x1cc (/usr/lib/aarch64-linux-gnu/libm.so.6)
       almabench 1356276 446583.794696:     306140 cycles:u:      fffff7f1f5b0 __fmod+0x30 (/usr/lib/aarch64-linux-gnu/libm.so.6)
       almabench 1356276 446583.794842:     364129 cycles:u:      fffff7f48ba0 __cos+0x3d0 (/usr/lib/aarch64-linux-gnu/libm.so.6)
       almabench 1356276 446583.795004:     409635 cycles:u:      fffff7f48808 __cos+0x38 (/usr/lib/aarch64-linux-gnu/libm.so.6)
       almabench 1356276 446583.795179:     446157 cycles:u:      fffff7f1f324 __atan2+0x24 (/usr/lib/aarch64-linux-gnu/libm.so.6)
       almabench 1356276 446583.795365:     476437 cycles:u:      fffff7f48280 __sin+0x1e0 (/usr/lib/aarch64-linux-gnu/libm.so.6)
       almabench 1356276 446583.795588:     501967 cycles:u:      fffff7f48a54 __cos+0x284 (/usr/lib/aarch64-linux-gnu/libm.so.6)
       almabench 1356276 446583.795777:     483010 cycles:u:      fffff7f1f340 __atan2+0x40 (/usr/lib/aarch64-linux-gnu/libm.so.6)
       almabench 1356276 446583.795968:     490104 cycles:u:      aaaaaaab2788 planetpv+0x934 (/home/kribey01/dev/llvm/build_test_suite-LTOthin-O3-bolt/SingleSource/Benchmarks/CoyoteBench/almabench)
       almabench 1356276 446583.796166:     510048 cycles:u:      fffff7f1f33c __atan2+0x3c (/usr/lib/aarch64-linux-gnu/libm.so.6)
       almabench 1356276 446583.796371:     529819 cycles:u:      fffff7f4275c __atan2_finite@GLIBC_2.17+0x48c (/usr/lib/aarch64-linux-gnu/libm.so.6)
...

When inspecting the list of events seen manually, it does seem like most samples are in sin, cos, atan2 and similar functions in the libm library. It is expected that perf2bolt wouldn't be able to map these samples back to the almabench binary, as they are in a dynamically loaded library. However, there are at least some samples on instructions that are located in the almabench main code section, for example the 3rd line from the bottom above. It's a sample seen on address aaaaaaab2788, and perf script says this is 0x934 bytes after the start of function planetpv in the binary. This sample and other samples like it should be mappable by perf2bolt, but somehow they're not.

After digging in using print-style debugging, it seems to me that the root cause is the code in BinaryContext::getBaseAddressForMapping. The gist of that code is as follows:

BinaryContext::getBaseAddressForMapping(uint64_t MMapAddress,
                                        uint64_t FileOffset) const {
  // Find a segment with a matching file offset.
  for (auto &KV : SegmentMapInfo) {
    const SegmentInfo &SegInfo = KV.second;
    if (alignDown(SegInfo.FileOffset, SegInfo.Alignment) ==
        alignDown(FileOffset, SegInfo.Alignment)) {
      return MMapAddress - (SegInfo.Address - SegInfo.FileOffset + FileOffset);
    }
  }
  return std::nullopt;
}

In other words the heuristic used to try and map the start address of a segment in a running process MMapAddress, and the associated FileOffset for it, seems to be checking whether a segment in the ElfFile has the following properties:

alignDown(SegInfo.FileOffset, SegInfo.Alignment) ==
alignDown(FileOffset, SegInfo.Alignment)

In the AArch64 running example, SegInfo.Alignment is 0x10000. FileOffset is

  1. It seems this heuristic ends up working for all 4 segments in the almabench binary, rather than just 1. Presumably because this is a small rather than a very large binary, where multiple segments fit within the same "alignment granule" of 0x10000?
gBAFM(MMapAddres aaaaaaab1000, FileOffset 1000):
.. MapStartAddress 0
.. alignDown(SegInfo.FileOffset 0, SegInfo.Alignment 10000) 0 == alignDown(FileOffset 1000, SegInfo.Alignment 10000) 0
.. MapStartAddress 11d40
.. alignDown(SegInfo.FileOffset 1d40, SegInfo.Alignment 10000) 0 == alignDown(FileOffset 1000, SegInfo.Alignment 10000) 0
.. MapStartAddress 22f20
.. alignDown(SegInfo.FileOffset 2f20, SegInfo.Alignment 10000) 0 == alignDown(FileOffset 1000, SegInfo.Alignment 10000) 0
.. MapStartAddress 33110
.. alignDown(SegInfo.FileOffset 3110, SegInfo.Alignment 10000) 0 == alignDown(FileOffset 1000, SegInfo.Alignment 10000) 0
.. Found 4 matching segments:
   0. SegmentInfo { Address: 0x0, Size: 0x1d1c, FileOffset: 0x0, FileSize: 0x1d1c, Alignment: 0x10000}
   1. SegmentInfo { Address: 0x11d40, Size: 0x11e0, FileOffset: 0x1d40, FileSize: 0x11e0, Alignment: 0x10000}
   2. SegmentInfo { Address: 0x22f20, Size: 0x10e0, FileOffset: 0x2f20, FileSize: 0x1f0, Alignment: 0x10000}
   3. SegmentInfo { Address: 0x33110, Size: 0x89, FileOffset: 0x3110, FileSize: 0x88, Alignment: 0x10000}
.. gBAFM(MMapAddress aaaaaaab1000, FileOffset 1000):
.... result = MMapAddress(aaaaaaab1000) - (SegInfo.Address(0) - SegInfo.FileOffset(0) + FileOffset(1000)) -> aaaaaaab0000

The code selects the first segment, which is not the text segment (the second segment is the text segment).

This heuristic should be improved. Even better, could this be made deterministic rather than based on a heuristic? How does linux perf compute this (presumably in a non-heuristic way)?

It seems perf computes this based on the name (and/or buildid?) of the binary, see roughly: https://github.com/torvalds/linux/blob/839c4f596f898edc424070dc8b517381572f8502/tools/perf/util/map.c#L166

Would it be OK to use filename and/or buildid too in perf2bolt? Presumably this should work unless the binary file moves around between profiling and bolt optimization?

https://github.com/llvm/llvm-project/blob/main/bolt/README.md#step-2-convert-profile-to-bolt-format documents that perf2bolt is expected to work when provided "a copy of the binary that was running". So matching based on filename of the binary isn't going to work well in general. https://github.com/torvalds/linux/blob/839c4f596f898edc424070dc8b517381572f8502/tools/perf/util/event.c#L337 indicates that perf can print out the buildid on a PERF_RECORD_MMAP2 event, but only when available. When not available, it seems it prints out the inode of the binary file.

In my setup, the PERF_RECORD_MMAP2 records don't contain build ids. Therefore it seems that just relying on build ids might not solve this issue "out of the box" for most users?

Another idea to improve the heuristic is that segments in ELF files contain flags to indicate whether they are (amongst others) executable. In the running example, I see objdump -a ./almabench contains:

...
LOAD off    0x0000000000000000 vaddr 0x0000000000000000 paddr 0x0000000000000000 align 2**16
     filesz 0x0000000000001d18 memsz 0x0000000000001d18 flags r--
LOAD off    0x0000000000001d40 vaddr 0x0000000000011d40 paddr 0x0000000000011d40 align 2**16
     filesz 0x00000000000011e0 memsz 0x00000000000011e0 flags r-x
LOAD off    0x0000000000002f20 vaddr 0x0000000000022f20 paddr 0x0000000000022f20 align 2**16
     filesz 0x00000000000001f0 memsz 0x00000000000010e0 flags rw-
LOAD off    0x0000000000003110 vaddr 0x0000000000033110 paddr 0x0000000000033110 align 2**16
     filesz 0x0000000000000088 memsz 0x0000000000000089 flags rw-
...

So, if we'd require the segment that matches to be an executable segment, in at least this example, the correct segment would be picked.

I'll be creating a PR that implements this shortly.

A further improvement that could be made is for perf2bolt to produce a warning when its heuristic works for multiple segments in the ELF file, rather than finding just one unique segment to match with the segment seen in process memory during profiling.

llvmbot commented 3 hours ago

@llvm/issue-subscribers-bolt

Author: Kristof Beyls (kbeyls)

When experimenting with `perf2bolt` on small programs, it looks like at least sometimes it does not map samples to the correct instructions. The root cause seems to be a failing heuristic in `perf2bolt`; more precisely the heuristic used to map instruction addresses in perf samples (i.e. addresses as they appear in running processes) to addresses of instructions in the ELF binary. ### Detailed description of experiment and observations I've observed this on at least [almabench](https://github.com/llvm/llvm-test-suite/blob/main/SingleSource/Benchmarks/CoyoteBench/almabench.c) from the [llvm test-suite](https://github.com/llvm/llvm-test-suite/). I observed this on an AWS Graviton4 machine running Ubuntu 22.04.1; building the test-suite with the following optimization flags `-O3 -flto=thin -g -fuse-ld=lld -Wl,--emit-relocs`. Perf samples were collected as follows: ``` $ cd SingleSource/Benchmarks/CoyoteBench $ perf record -e cycles:u ./almabench ``` `perf2bolt` was invoked as follows: ``` $ ~/dev/llvm/build_rel/bin/perf2bolt -p perf.data -o almabench.perf.perf2bolt.fdata -nl ./almabench BOLT-INFO: shared object or position-independent executable detected PERF2BOLT: Starting data aggregation job for perf.data PERF2BOLT: spawning perf job to read events without LBR PERF2BOLT: spawning perf job to read mem events PERF2BOLT: spawning perf job to read process events PERF2BOLT: spawning perf job to read task events BOLT-INFO: Target architecture: aarch64 BOLT-INFO: BOLT version: 86e5c5468ae3fcd65b23fd7b3cb0182e676829bd BOLT-INFO: first alloc address is 0x0 BOLT-INFO: creating new program header table at address 0x200000, offset 0x200000 BOLT-INFO: enabling relocation mode BOLT-INFO: enabling strict relocation mode for aggregation purposes BOLT-INFO: pre-processing profile using perf data aggregator BOLT-WARNING: build-id will not be checked because we could not read one from input binary PERF2BOLT: waiting for perf mmap events collection to finish... PERF2BOLT: parsing perf-script mmap events output PERF2BOLT: waiting for perf task events collection to finish... PERF2BOLT: parsing perf-script task events output PERF2BOLT: input binary is associated with 1 PID(s) PERF2BOLT: waiting for perf events collection to finish... PERF2BOLT: parsing basic events (without LBR)... PERF2BOLT: waiting for perf mem events collection to finish... PERF2BOLT: processing basic events (without LBR)... PERF2BOLT: read 13429 samples PERF2BOLT: out of range samples recorded in unknown regions: 13429 (100.0%) !! WARNING !! This high mismatch ratio indicates the input binary is probably not the same binary used during profiling collection. The generated data may b e ineffective for improving performance. PERF2BOLT: wrote 0 objects and 0 memory objects to almabench.perf.perf2bolt.fdata BOLT-INFO: 0 out of 11 functions in the binary (0.0%) have non-empty execution profile ``` #### Comparing with what `perf script` shows ``` $ perf script --show-mmap-events | wc -l 13434 $ perf script --show-mmap-events almabench 1356276 446583.793818: PERF_RECORD_MMAP2 1356276/1356276: [0xaaaaaaab1000(0x2000) @ 0x1000 103:01 4733784 3684065051]: r-xp /home/kribey01/d ev/llvm/build_test_suite-LTOthin-O3-bolt/SingleSource/Benchmarks/CoyoteBench/almabench almabench 1356276 446583.793832: PERF_RECORD_MMAP2 1356276/1356276: [0xfffff7fc2000(0x3e000) @ 0 103:01 21002 578735524]: r-xp /usr/lib/aarch64-linux- gnu/ld-linux-aarch64.so.1 almabench 1356276 446583.793844: PERF_RECORD_MMAP2 1356276/1356276: [0xfffff7ffb000(0x1000) @ 0 00:00 0 0]: r-xp [vdso] almabench 1356276 446583.793873: 1 cycles:u: ffffdb66ec538f08 [unknown] ([unknown]) almabench 1356276 446583.793895: 1 cycles:u: fffff7fd9c40 _start+0x0 (/usr/lib/aarch64-linux-gnu/ld-linux-aarch64.so.1) almabench 1356276 446583.793912: 1 cycles:u: fffff7fda610 _dl_start+0x0 (/usr/lib/aarch64-linux-gnu/ld-linux-aarch64.so.1) almabench 1356276 446583.793929: 3 cycles:u: ffffdb66ec5389d8 [unknown] ([unknown]) almabench 1356276 446583.793947: 9 cycles:u: ffffdb66ec5389d8 [unknown] ([unknown]) almabench 1356276 446583.793966: 25 cycles:u: fffff7fda648 _dl_start+0x38 (/usr/lib/aarch64-linux-gnu/ld-linux-aarch64.so.1) almabench 1356276 446583.793985: 66 cycles:u: fffff7fdaa5c _dl_start+0x44c (/usr/lib/aarch64-linux-gnu/ld-linux-aarch64.so.1) almabench 1356276 446583.794002: 167 cycles:u: fffff7fda6d0 _dl_start+0xc0 (/usr/lib/aarch64-linux-gnu/ld-linux-aarch64.so.1) almabench 1356276 446583.794020: 414 cycles:u: ffffdb66ec5389d8 [unknown] ([unknown]) almabench 1356276 446583.794040: 1145 cycles:u: fffff7fda8f8 _dl_start+0x2e8 (/usr/lib/aarch64-linux-gnu/ld-linux-aarch64.so.1) almabench 1356276 446583.794056: 2965 cycles:u: fffff7fd5a40 __GI___tunables_init+0xe0 (/usr/lib/aarch64-linux-gnu/ld-linux-aarch64.so.1) almabench 1356276 446583.794078: 7286 cycles:u: fffff7fdb978 dl_main+0x9f8 (/usr/lib/aarch64-linux-gnu/ld-linux-aarch64.so.1) almabench 1356276 446583.794121: PERF_RECORD_MMAP2 1356276/1356276: [0xfffff7f10000(0x97000) @ 0 103:01 21020 4096449059]: r-xp /usr/lib/aarch64-linux-gnu/libm.so.6 almabench 1356276 446583.794150: 19937 cycles:u: fffff7fcd6cc _dl_name_match_p+0x2c (/usr/lib/aarch64-linux-gnu/ld-linux-aarch64.so.1) almabench 1356276 446583.794167: PERF_RECORD_MMAP2 1356276/1356276: [0xfffff7d60000(0x1a9000) @ 0 103:01 21006 1342113607]: r-xp /usr/lib/aarch64-linux-gnu/libc.so.6 almabench 1356276 446583.794228: 45966 cycles:u: fffff7fd0e80 _dl_relocate_object+0x660 (/usr/lib/aarch64-linux-gnu/ld-linux-aarch64.so.1) almabench 1356276 446583.794305: 60218 cycles:u: fffff7f426a8 __atan2_finite@GLIBC_2.17+0x3d8 (/usr/lib/aarch64-linux-gnu/libm.so.6) almabench 1356276 446583.794348: 76663 cycles:u: fffff7f1f340 __atan2+0x40 (/usr/lib/aarch64-linux-gnu/libm.so.6) almabench 1356276 446583.794399: 98506 cycles:u: fffff7f1f340 __atan2+0x40 (/usr/lib/aarch64-linux-gnu/libm.so.6) almabench 1356276 446583.794471: 157477 cycles:u: fffff7f487d8 __cos+0x8 (/usr/lib/aarch64-linux-gnu/libm.so.6) almabench 1356276 446583.794571: 234290 cycles:u: fffff7f4249c __atan2_finite@GLIBC_2.17+0x1cc (/usr/lib/aarch64-linux-gnu/libm.so.6) almabench 1356276 446583.794696: 306140 cycles:u: fffff7f1f5b0 __fmod+0x30 (/usr/lib/aarch64-linux-gnu/libm.so.6) almabench 1356276 446583.794842: 364129 cycles:u: fffff7f48ba0 __cos+0x3d0 (/usr/lib/aarch64-linux-gnu/libm.so.6) almabench 1356276 446583.795004: 409635 cycles:u: fffff7f48808 __cos+0x38 (/usr/lib/aarch64-linux-gnu/libm.so.6) almabench 1356276 446583.795179: 446157 cycles:u: fffff7f1f324 __atan2+0x24 (/usr/lib/aarch64-linux-gnu/libm.so.6) almabench 1356276 446583.795365: 476437 cycles:u: fffff7f48280 __sin+0x1e0 (/usr/lib/aarch64-linux-gnu/libm.so.6) almabench 1356276 446583.795588: 501967 cycles:u: fffff7f48a54 __cos+0x284 (/usr/lib/aarch64-linux-gnu/libm.so.6) almabench 1356276 446583.795777: 483010 cycles:u: fffff7f1f340 __atan2+0x40 (/usr/lib/aarch64-linux-gnu/libm.so.6) almabench 1356276 446583.795968: 490104 cycles:u: aaaaaaab2788 planetpv+0x934 (/home/kribey01/dev/llvm/build_test_suite-LTOthin-O3-bolt/SingleSource/Benchmarks/CoyoteBench/almabench) almabench 1356276 446583.796166: 510048 cycles:u: fffff7f1f33c __atan2+0x3c (/usr/lib/aarch64-linux-gnu/libm.so.6) almabench 1356276 446583.796371: 529819 cycles:u: fffff7f4275c __atan2_finite@GLIBC_2.17+0x48c (/usr/lib/aarch64-linux-gnu/libm.so.6) ... ``` When inspecting the list of events seen manually, it does seem like most samples are in `sin`, `cos`, `atan2` and similar functions in the `libm` library. It is expected that `perf2bolt` wouldn't be able to map these samples back to the `almabench` binary, as they are in a dynamically loaded library. However, there are at least some samples on instructions that are located in the `almabench` main code section, for example the 3rd line from the bottom above. It's a sample seen on address `aaaaaaab2788`, and `perf script` says this is `0x934` bytes after the start of function `planetpv` in the binary. This sample and other samples like it should be mappable by `perf2bolt`, but somehow they're not. After digging in using print-style debugging, it seems to me that the root cause is the code in [`BinaryContext::getBaseAddressForMapping`](https://github.com/llvm/llvm-project/blob/3e32e455913f0fe56f53f2f5711ea0130473d99a/bolt/lib/Core/BinaryContext.cpp#L2019). The gist of that code is as follows: ``` BinaryContext::getBaseAddressForMapping(uint64_t MMapAddress, uint64_t FileOffset) const { // Find a segment with a matching file offset. for (auto &KV : SegmentMapInfo) { const SegmentInfo &SegInfo = KV.second; if (alignDown(SegInfo.FileOffset, SegInfo.Alignment) == alignDown(FileOffset, SegInfo.Alignment)) { return MMapAddress - (SegInfo.Address - SegInfo.FileOffset + FileOffset); } } return std::nullopt; } ``` In other words the heuristic used to try and map the start address of a segment in a running process `MMapAddress`, and the associated `FileOffset` for it, seems to be checking whether a segment in the ElfFile has the following properties: ``` alignDown(SegInfo.FileOffset, SegInfo.Alignment) == alignDown(FileOffset, SegInfo.Alignment) ``` In the AArch64 running example, `SegInfo.Alignment` is 0x10000. `FileOffset` is 1000. It seems this heuristic ends up working for all 4 segments in the `almabench` binary, rather than just 1. Presumably because this is a small rather than a very large binary, where multiple segments fit within the same "alignment granule" of `0x10000`? ``` gBAFM(MMapAddres aaaaaaab1000, FileOffset 1000): .. MapStartAddress 0 .. alignDown(SegInfo.FileOffset 0, SegInfo.Alignment 10000) 0 == alignDown(FileOffset 1000, SegInfo.Alignment 10000) 0 .. MapStartAddress 11d40 .. alignDown(SegInfo.FileOffset 1d40, SegInfo.Alignment 10000) 0 == alignDown(FileOffset 1000, SegInfo.Alignment 10000) 0 .. MapStartAddress 22f20 .. alignDown(SegInfo.FileOffset 2f20, SegInfo.Alignment 10000) 0 == alignDown(FileOffset 1000, SegInfo.Alignment 10000) 0 .. MapStartAddress 33110 .. alignDown(SegInfo.FileOffset 3110, SegInfo.Alignment 10000) 0 == alignDown(FileOffset 1000, SegInfo.Alignment 10000) 0 .. Found 4 matching segments: 0. SegmentInfo { Address: 0x0, Size: 0x1d1c, FileOffset: 0x0, FileSize: 0x1d1c, Alignment: 0x10000} 1. SegmentInfo { Address: 0x11d40, Size: 0x11e0, FileOffset: 0x1d40, FileSize: 0x11e0, Alignment: 0x10000} 2. SegmentInfo { Address: 0x22f20, Size: 0x10e0, FileOffset: 0x2f20, FileSize: 0x1f0, Alignment: 0x10000} 3. SegmentInfo { Address: 0x33110, Size: 0x89, FileOffset: 0x3110, FileSize: 0x88, Alignment: 0x10000} .. gBAFM(MMapAddress aaaaaaab1000, FileOffset 1000): .... result = MMapAddress(aaaaaaab1000) - (SegInfo.Address(0) - SegInfo.FileOffset(0) + FileOffset(1000)) -> aaaaaaab0000 ``` The code selects the first segment, which is not the text segment (the second segment is the text segment). This heuristic should be improved. Even better, could this be made deterministic rather than based on a heuristic? How does linux perf compute this (presumably in a non-heuristic way)? It seems perf computes this based on the name (and/or buildid?) of the binary, see roughly: https://github.com/torvalds/linux/blob/839c4f596f898edc424070dc8b517381572f8502/tools/perf/util/map.c#L166 Would it be OK to use filename and/or buildid too in perf2bolt? Presumably this should work unless the binary file moves around between profiling and bolt optimization? * https://github.com/llvm/llvm-project/blob/main/bolt/README.md#step-2-convert-profile-to-bolt-format documents that perf2bolt is expected to work when provided "a copy of the binary that was running". So matching based on filename of the binary isn't going to work well in general. * https://github.com/torvalds/linux/blob/839c4f596f898edc424070dc8b517381572f8502/tools/perf/util/event.c#L337 indicates that perf can print out the buildid on a PERF_RECORD_MMAP2 event, but only when available. When not available, it seems it prints out the inode of the binary file. In my setup, the PERF_RECORD_MMAP2 records don't contain build ids. Therefore it seems that just relying on build ids might not solve this issue "out of the box" for most users? Another idea to improve the heuristic is that segments in ELF files contain flags to indicate whether they are (amongst others) executable. In the running example, I see `objdump -a ./almabench` contains: ``` ... LOAD off 0x0000000000000000 vaddr 0x0000000000000000 paddr 0x0000000000000000 align 2**16 filesz 0x0000000000001d18 memsz 0x0000000000001d18 flags r-- LOAD off 0x0000000000001d40 vaddr 0x0000000000011d40 paddr 0x0000000000011d40 align 2**16 filesz 0x00000000000011e0 memsz 0x00000000000011e0 flags r-x LOAD off 0x0000000000002f20 vaddr 0x0000000000022f20 paddr 0x0000000000022f20 align 2**16 filesz 0x00000000000001f0 memsz 0x00000000000010e0 flags rw- LOAD off 0x0000000000003110 vaddr 0x0000000000033110 paddr 0x0000000000033110 align 2**16 filesz 0x0000000000000088 memsz 0x0000000000000089 flags rw- ... ``` So, if we'd require the segment that matches to be an executable segment, in at least this example, the correct segment would be picked. I'll be creating a PR that implements this shortly. A further improvement that could be made is for `perf2bolt` to produce a warning when its heuristic works for multiple segments in the ELF file, rather than finding just one unique segment to match with the segment seen in process memory during profiling.