Open foxtran opened 1 year ago
Unfortunately, I reproduced that result on AMD Ryzen 7700.
BASELINE:
Performance counter stats for 'bash -c ninja clang && ninja clean' (5 runs):
30,795,048,649,078 instructions # 0.78 insn per cycle ( +- 0.00% ) (75.02%)
39,645,798,620,353 cycles ( +- 0.01% ) (75.02%)
82,058,692,208 L1-icache-misses ( +- 0.01% ) (75.02%)
48,600,346,966 iTLB-misses ( +- 0.03% ) (75.03%)
512.3711 +- 0.0459 seconds time elapsed ( +- 0.01% )
PROPELLER:
Performance counter stats for 'bash -c ninja clang && ninja clean' (5 runs):
31,738,767,572,617 instructions # 0.62 insn per cycle ( +- 0.00% ) (75.02%)
51,114,393,709,732 cycles ( +- 0.02% ) (75.02%)
93,464,053,895 L1-icache-misses ( +- 0.01% ) (75.02%)
66,714,207,526 iTLB-misses ( +- 0.02% ) (75.01%)
654.6605 +- 0.0541 seconds time elapsed ( +- 0.01% )
Processor:
# lscpu
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Address sizes: 48 bits physical, 48 bits virtual
Byte Order: Little Endian
CPU(s): 16
On-line CPU(s) list: 0-15
Vendor ID: AuthenticAMD
BIOS Vendor ID: Advanced Micro Devices, Inc.
Model name: AMD Ryzen 7 7700 8-Core Processor
Used OS:
# cat /etc/os-release
PRETTY_NAME="Debian GNU/Linux trixie/sid"
NAME="Debian GNU/Linux"
VERSION_CODENAME=trixie
ID=debian
HOME_URL="https://www.debian.org/"
SUPPORT_URL="https://www.debian.org/support"
BUG_REPORT_URL="https://bugs.debian.org/"
Kernel:
Linux XXX.XXX.XXX.XXX 6.5.0-1-amd64 #1 SMP PREEMPT_DYNAMIC Debian 6.5.6-2.1 (2023-10-08) x86_64 GNU/Linux
Perf version:
# perf --version
perf version 6.5.6
llvm18 use Fixed MBB ID(https://github.com/llvm/llvm-project/commit/3d6841b2b1a138b5b19fd57950079822565a3786). While autofdo now revert the code which support 'Fixed MBB ID' (https://github.com/google/autofdo/commit/ad3e924f907b17c5ec7ab1ffe1849d9ce2d4b45f). I think it may be the reason.
Has this issue been resolved?
Use trunk to build (changed two places to solve coredump #190 ), and reproduced the results on the Intel machine.
And setting PATH_TO_TRUNK_LLVM_INSTALL=llvm17
is still slow. Does autofdo also need to switch to llvm17? @lifengxiang1025
base_line:
Performance counter stats for 'bash -c ninja -j48 clang && ninja clean' (5 runs):
31,416,027,077,187 instructions # 0.71 insn per cycle ( +- 0.03% ) (95.99%)
44,129,300,535,322 cycles ( +- 0.03% ) (95.89%)
2,417,617,615,381 L1-icache-misses ( +- 0.06% ) (95.80%)
18,663,312,138 iTLB-misses ( +- 0.05% ) (95.69%)
353.482 +- 0.258 seconds time elapsed ( +- 0.07% )
propeller
32,310,032,286,299 instructions # 0.63 insn per cycle ( +- 0.02% ) (96.32%)
51,216,471,339,669 cycles ( +- 0.04% ) (96.23%)
3,250,199,372,974 L1-icache-misses ( +- 0.05% ) (96.14%)
20,012,345,482 iTLB-misses ( +- 0.05% ) (96.05%)
406.699 +- 0.304 seconds time elapsed ( +- 0.07% )
$ lscpu
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 64
On-line CPU(s) list: 0-63
Thread(s) per core: 2
Core(s) per socket: 16
Socket(s): 2
NUMA node(s): 2
Vendor ID: GenuineIntel
CPU family: 6
Model: 85
Model name: Intel(R) Xeon(R) Gold 5218 CPU @ 2.30GHz
Has this issue been resolved? Use trunk to build (changed two places to solve coredump #190 ), and reproduced the results on the Intel machine. And setting
PATH_TO_TRUNK_LLVM_INSTALL=llvm17
is still slow. Does autofdo also need to switch to llvm17? @lifengxiang1025
I use this code snap(https://github.com/llvm/llvm-project/commit/3d6841b2b1a138b5b19fd57950079822565a3786) and propeller seems work well with llvm16(I think llvm17 is ok). The reason is:
llvm18 use Fixed MBB ID(https://github.com/llvm/llvm-project/commit/3d6841b2b1a138b5b19fd57950079822565a3786). While autofdo now revert the code which support 'Fixed MBB ID' (https://github.com/google/autofdo/commit/ad3e924f907b17c5ec7ab1ffe1849d9ce2d4b45f). I think it may be the reason.
Yes, my experiment results also showed the root cause might be the MBB ID. The original control flow graph of the tested function is
Using the audoFDO with llvm19, the CFG became into
We can see some basic blocks are split (like the BB2 became BB2-1 and BB2-2), which is unnecessary because the instruction before the unconditional branch in BB2-1 is a call and there is no C++ exception-related code in BB2.
Besides, some basic blocks are hot but their only predecessor is not. For example, BB17 is identified as a hot BB but its predecessor BB16 is not. BB34 and BB41 are the same. Also, some basic blocks are hot but their successor is not. For example, BB37 and BB44 are identified as hot but their predecessor BB45 is not. Also, there's no C++ exception code.
Then I switched to llvm17. However, only the function reorder seems functional. I also verified llvm17 using a hand-written C++ code example, the cold/hot split is not functional. The llvm19 worked for this small example.
At last, I used llvm17 to generate the profile data and llvm19 to generate the optimized binary. It works.
I haven't checked if llvm17 contains the cold/hot split code. I will check it later.
I have tried to reproduce optimization clang with Propeller.
After all modifications described in #179 and #180, modified https://github.com/google/autofdo/blob/master/propeller_optimize_clang.sh started to work on my machine.
Unfortunately, the results looks very strange. Applying propeller to clang slows it down about 20%:
I used numactl to pin threads to HW cores. When I disables pins, the results were improved slightly, but, the gap between baseline and propeller continues to be significant:
In the case of pinned threads, propeller slightly decreased iTLB misses, while L1-icache-misses increased about 1.5x times.
Tested in RAM-disk on $ lscpu
Used OS:
Used linux kernel:
Gists: with numactl: https://gist.github.com/foxtran/b7fedfbb0bd036629448ce62d18bd7a6 without numactl: https://gist.github.com/foxtran/fdc4abf8e2de127800f670b9edeeb9f2
Applied patches (with #179, #180): for numactl:
Without numactl: