llvm / llvm-project

The LLVM Project is a collection of modular and reusable compiler and toolchain technologies.
http://llvm.org
Other
28.33k stars 11.7k forks source link

[AMDGPU] With Clang>17, -amdgpu-early-inline-all=true consumes 8x more memory #86332

Open AngryLoki opened 6 months ago

AngryLoki commented 6 months ago

There is some kind of regression in -amdgpu-early-inline-all=true option, which is set for every HIP application in hipcc.

While this option makes no significant performance/memory impact in Clang 17, attempt to migrate to Clang 18.1.0 or nightly Clang 19 build consumes 8x more memory, which makes Clang unusable for HIP (i. e. when multiple compile units consume 10GB each in parallel, there is just not enough RAM eventually, even when compiling for single target GPU arch).

Environment:

/usr/lib/llvm/17/bin/clang-17 --version | grep version
clang version 17.0.6

/usr/lib/llvm/18/bin/clang-18 --version | grep version
clang version 18.1.0

/usr/lib/llvm/19/bin/clang-19 --version | grep version
clang version 19.0.0git6d3cec01

Common flags (verbose output of composable-kernel-6.0.2):

export FLAGS="-cc1 -triple amdgcn-amd-amdhsa -aux-triple x86_64-pc-linux-gnu -emit-obj -disable-free -clear-ast-before-backend -disable-llvm-verifier -discard-value-names -main-file-name device_batchnorm_forward_f32_instance.cpp -mrelocation-model pic -pic-level 2 -fhalf-no-semantic-interposition -mframe-pointer=none -fno-rounding-math -mconstructor-aliases -aux-target-cpu x86-64 -fcuda-is-device -mllvm -amdgpu-internalize-symbols -fcuda-allow-variadic-functions -fvisibility=hidden -fapply-global-visibility-to-externs -mlink-builtin-bitcode /usr/lib/amdgcn/bitcode/hip.bc -mlink-builtin-bitcode /usr/lib/amdgcn/bitcode/ocml.bc -mlink-builtin-bitcode /usr/lib/amdgcn/bitcode/ockl.bc -mlink-builtin-bitcode /usr/lib/amdgcn/bitcode/oclc_daz_opt_off.bc -mlink-builtin-bitcode /usr/lib/amdgcn/bitcode/oclc_unsafe_math_off.bc -mlink-builtin-bitcode /usr/lib/amdgcn/bitcode/oclc_finite_only_off.bc -mlink-builtin-bitcode /usr/lib/amdgcn/bitcode/oclc_correctly_rounded_sqrt_on.bc -mlink-builtin-bitcode /usr/lib/amdgcn/bitcode/oclc_wavefrontsize64_off.bc -mlink-builtin-bitcode /usr/lib/amdgcn/bitcode/oclc_isa_version_1030.bc -mlink-builtin-bitcode /usr/lib/amdgcn/bitcode/oclc_abi_version_400.bc -target-cpu gfx1030 -debugger-tuning=gdb -fdebug-compilation-dir=/var/tmp/portage/sci-libs/composable-kernel-6.0.2/work/composable_kernel-rocm-6.0.2_build -resource-dir /usr/lib/clang/17 -dependency-file library/src/tensor_operation_instance/gpu/batchnorm/CMakeFiles/device_batchnorm_instance.dir/device_batchnorm_forward_f32_instance.cpp.o.d -MT library/src/tensor_operation_instance/gpu/batchnorm/CMakeFiles/device_batchnorm_instance.dir/device_batchnorm_forward_f32_instance.cpp.o -sys-header-deps -internal-isystem /usr/lib/clang/17/include/cuda_wrappers -idirafter /usr/local/include -include __clang_hip_runtime_wrapper.h -include /usr/include/gentoo/fortify.h -include /usr/include/gentoo/maybe-stddefs.h -D CK_ENABLE_BF16 -D CK_ENABLE_BF8 -D CK_ENABLE_FP16 -D CK_ENABLE_FP32 -D CK_ENABLE_FP64 -D CK_ENABLE_FP8 -D CK_ENABLE_INT8 -D USE_PROF_API=1 -D __HIP_PLATFORM_AMD__=1 -D __HIP_PLATFORM_HCC__=1 -I /var/tmp/portage/sci-libs/composable-kernel-6.0.2/work/composable_kernel-rocm-6.0.2/library/include -I /var/tmp/portage/sci-libs/composable-kernel-6.0.2/work/composable_kernel-rocm-6.0.2/include -I /var/tmp/portage/sci-libs/composable-kernel-6.0.2/work/composable_kernel-rocm-6.0.2_build/include -internal-isystem /usr/lib/gcc/x86_64-pc-linux-gnu/13/include/g++-v13 -internal-isystem /usr/lib/gcc/x86_64-pc-linux-gnu/13/include/g++-v13/x86_64-pc-linux-gnu -internal-isystem /usr/lib/gcc/x86_64-pc-linux-gnu/13/include/g++-v13/backward -internal-isystem /usr/lib/gcc/x86_64-pc-linux-gnu/13/include/g++-v13 -internal-isystem /usr/lib/gcc/x86_64-pc-linux-gnu/13/include/g++-v13/x86_64-pc-linux-gnu -internal-isystem /usr/lib/gcc/x86_64-pc-linux-gnu/13/include/g++-v13/backward -internal-isystem /usr/lib/clang/17/include -internal-isystem /usr/local/include -internal-isystem /usr/x86_64-pc-linux-gnu/include -internal-externc-isystem /include -internal-externc-isystem /usr/include -internal-isystem /usr/lib/clang/17/include -internal-isystem /usr/local/include -internal-isystem /usr/x86_64-pc-linux-gnu/include -internal-externc-isystem /include -internal-externc-isystem /usr/include -O3 -std=c++17 -fdeprecated-macro -fno-autolink -ferror-limit 19 -fmessage-length=173 -fhip-new-launch-api -fgnuc-version=4.2.1 -fcxx-exceptions -fexceptions -fcolor-diagnostics -vectorize-loops -vectorize-slp -mllvm -amdgpu-function-calls=false -cuid=aa0b75146f478e4b -fcuda-allow-variadic-functions -faddrsig -D__GCC_HAVE_DWARF2_CFI_ASM=1 -o /tmp/device_batchnorm_forward_f32_instance-gfx1030-437c24.o -x hip /var/tmp/portage/sci-libs/composable-kernel-6.0.2/work/composable_kernel-rocm-6.0.2/library/src/tensor_operation_instance/gpu/batchnorm/device_batchnorm_forward_f32_instance.cpp"

Without -amdgpu-early-inline-all=true everything is fine:

/usr/bin/time -f 'Memory: %M KB, Time: %E' /usr/lib/llvm/17/bin/clang-17 $FLAGS
Memory: 818272 KB, Time: 0:20.62

/usr/bin/time -f 'Memory: %M KB, Time: %E' /usr/lib/llvm/18/bin/clang-18 $FLAGS
Memory: 830300 KB, Time: 0:18.28

/usr/bin/time -f 'Memory: %M KB, Time: %E' /usr/lib/llvm/19/bin/clang-19 $FLAGS
Memory: 861772 KB, Time: 0:22.69

With -amdgpu-early-inline-all=true Clang 18 and 19 are hungry and slow:

/usr/bin/time -f 'Memory: %M KB, Time: %E' /usr/lib/llvm/17/bin/clang-17 $FLAGS -mllvm -amdgpu-early-inline-all=true
Memory: 818240 KB, Time: 0:20.80

/usr/bin/time -f 'Memory: %M KB, Time: %E' /usr/lib/llvm/18/bin/clang-18 $FLAGS -mllvm -amdgpu-early-inline-all=true
Memory: 6402824 KB, Time: 1:02.50

/usr/bin/time -f 'Memory: %M KB, Time: %E' /usr/lib/llvm/19/bin/clang-19 $FLAGS -mllvm -amdgpu-early-inline-all=true
Memory: 6343976 KB, Time: 1:12.43

I don't provide preprocessed version of device_batchnorm_forward_f32_instance.cpp, because for some reason I can't rebuild it after preprocessing (complaints about constexprs). However if you need it or some other dumps, please ask and I will attach.

llvmbot commented 6 months ago

@llvm/issue-subscribers-backend-amdgpu

Author: None (AngryLoki)

There is some kind of regression in `-amdgpu-early-inline-all=true` option, which is set for every HIP application in hipcc. While this option makes no significant performance/memory impact in Clang 17, attempt to migrate to Clang 18.1.0 or nightly Clang 19 build consumes 8x more memory, which makes Clang unusable for HIP (i. e. when multiple compile units consume 10GB each in parallel, there is just not enough RAM eventually, even when compiling for single target GPU arch). Environment: ``` /usr/lib/llvm/17/bin/clang-17 --version | grep version clang version 17.0.6 /usr/lib/llvm/18/bin/clang-18 --version | grep version clang version 18.1.0 /usr/lib/llvm/19/bin/clang-19 --version | grep version clang version 19.0.0git6d3cec01 ``` Common flags (verbose output of [composable-kernel-6.0.2](https://github.com/ROCm/composable_kernel/tree/rocm-6.0.2)): ``` export FLAGS="-cc1 -triple amdgcn-amd-amdhsa -aux-triple x86_64-pc-linux-gnu -emit-obj -disable-free -clear-ast-before-backend -disable-llvm-verifier -discard-value-names -main-file-name device_batchnorm_forward_f32_instance.cpp -mrelocation-model pic -pic-level 2 -fhalf-no-semantic-interposition -mframe-pointer=none -fno-rounding-math -mconstructor-aliases -aux-target-cpu x86-64 -fcuda-is-device -mllvm -amdgpu-internalize-symbols -fcuda-allow-variadic-functions -fvisibility=hidden -fapply-global-visibility-to-externs -mlink-builtin-bitcode /usr/lib/amdgcn/bitcode/hip.bc -mlink-builtin-bitcode /usr/lib/amdgcn/bitcode/ocml.bc -mlink-builtin-bitcode /usr/lib/amdgcn/bitcode/ockl.bc -mlink-builtin-bitcode /usr/lib/amdgcn/bitcode/oclc_daz_opt_off.bc -mlink-builtin-bitcode /usr/lib/amdgcn/bitcode/oclc_unsafe_math_off.bc -mlink-builtin-bitcode /usr/lib/amdgcn/bitcode/oclc_finite_only_off.bc -mlink-builtin-bitcode /usr/lib/amdgcn/bitcode/oclc_correctly_rounded_sqrt_on.bc -mlink-builtin-bitcode /usr/lib/amdgcn/bitcode/oclc_wavefrontsize64_off.bc -mlink-builtin-bitcode /usr/lib/amdgcn/bitcode/oclc_isa_version_1030.bc -mlink-builtin-bitcode /usr/lib/amdgcn/bitcode/oclc_abi_version_400.bc -target-cpu gfx1030 -debugger-tuning=gdb -fdebug-compilation-dir=/var/tmp/portage/sci-libs/composable-kernel-6.0.2/work/composable_kernel-rocm-6.0.2_build -resource-dir /usr/lib/clang/17 -dependency-file library/src/tensor_operation_instance/gpu/batchnorm/CMakeFiles/device_batchnorm_instance.dir/device_batchnorm_forward_f32_instance.cpp.o.d -MT library/src/tensor_operation_instance/gpu/batchnorm/CMakeFiles/device_batchnorm_instance.dir/device_batchnorm_forward_f32_instance.cpp.o -sys-header-deps -internal-isystem /usr/lib/clang/17/include/cuda_wrappers -idirafter /usr/local/include -include __clang_hip_runtime_wrapper.h -include /usr/include/gentoo/fortify.h -include /usr/include/gentoo/maybe-stddefs.h -D CK_ENABLE_BF16 -D CK_ENABLE_BF8 -D CK_ENABLE_FP16 -D CK_ENABLE_FP32 -D CK_ENABLE_FP64 -D CK_ENABLE_FP8 -D CK_ENABLE_INT8 -D USE_PROF_API=1 -D __HIP_PLATFORM_AMD__=1 -D __HIP_PLATFORM_HCC__=1 -I /var/tmp/portage/sci-libs/composable-kernel-6.0.2/work/composable_kernel-rocm-6.0.2/library/include -I /var/tmp/portage/sci-libs/composable-kernel-6.0.2/work/composable_kernel-rocm-6.0.2/include -I /var/tmp/portage/sci-libs/composable-kernel-6.0.2/work/composable_kernel-rocm-6.0.2_build/include -internal-isystem /usr/lib/gcc/x86_64-pc-linux-gnu/13/include/g++-v13 -internal-isystem /usr/lib/gcc/x86_64-pc-linux-gnu/13/include/g++-v13/x86_64-pc-linux-gnu -internal-isystem /usr/lib/gcc/x86_64-pc-linux-gnu/13/include/g++-v13/backward -internal-isystem /usr/lib/gcc/x86_64-pc-linux-gnu/13/include/g++-v13 -internal-isystem /usr/lib/gcc/x86_64-pc-linux-gnu/13/include/g++-v13/x86_64-pc-linux-gnu -internal-isystem /usr/lib/gcc/x86_64-pc-linux-gnu/13/include/g++-v13/backward -internal-isystem /usr/lib/clang/17/include -internal-isystem /usr/local/include -internal-isystem /usr/x86_64-pc-linux-gnu/include -internal-externc-isystem /include -internal-externc-isystem /usr/include -internal-isystem /usr/lib/clang/17/include -internal-isystem /usr/local/include -internal-isystem /usr/x86_64-pc-linux-gnu/include -internal-externc-isystem /include -internal-externc-isystem /usr/include -O3 -std=c++17 -fdeprecated-macro -fno-autolink -ferror-limit 19 -fmessage-length=173 -fhip-new-launch-api -fgnuc-version=4.2.1 -fcxx-exceptions -fexceptions -fcolor-diagnostics -vectorize-loops -vectorize-slp -mllvm -amdgpu-function-calls=false -cuid=aa0b75146f478e4b -fcuda-allow-variadic-functions -faddrsig -D__GCC_HAVE_DWARF2_CFI_ASM=1 -o /tmp/device_batchnorm_forward_f32_instance-gfx1030-437c24.o -x hip /var/tmp/portage/sci-libs/composable-kernel-6.0.2/work/composable_kernel-rocm-6.0.2/library/src/tensor_operation_instance/gpu/batchnorm/device_batchnorm_forward_f32_instance.cpp" ``` Without `-amdgpu-early-inline-all=true` everything is fine: ``` /usr/bin/time -f 'Memory: %M KB, Time: %E' /usr/lib/llvm/17/bin/clang-17 $FLAGS Memory: 818272 KB, Time: 0:20.62 /usr/bin/time -f 'Memory: %M KB, Time: %E' /usr/lib/llvm/18/bin/clang-18 $FLAGS Memory: 830300 KB, Time: 0:18.28 /usr/bin/time -f 'Memory: %M KB, Time: %E' /usr/lib/llvm/19/bin/clang-19 $FLAGS Memory: 861772 KB, Time: 0:22.69 ``` With `-amdgpu-early-inline-all=true` Clang 18 and 19 are hungry and slow: ``` /usr/bin/time -f 'Memory: %M KB, Time: %E' /usr/lib/llvm/17/bin/clang-17 $FLAGS -mllvm -amdgpu-early-inline-all=true Memory: 818240 KB, Time: 0:20.80 /usr/bin/time -f 'Memory: %M KB, Time: %E' /usr/lib/llvm/18/bin/clang-18 $FLAGS -mllvm -amdgpu-early-inline-all=true Memory: 6402824 KB, Time: 1:02.50 /usr/bin/time -f 'Memory: %M KB, Time: %E' /usr/lib/llvm/19/bin/clang-19 $FLAGS -mllvm -amdgpu-early-inline-all=true Memory: 6343976 KB, Time: 1:12.43 ``` I don't provide preprocessed version of [device_batchnorm_forward_f32_instance.cpp](https://github.com/ROCm/composable_kernel/blob/rocm-6.0.2/library/src/tensor_operation_instance/gpu/batchnorm/device_batchnorm_forward_f32_instance.cpp), because for some reason I can't rebuild it after preprocessing (complaints about constexprs). However if you need it or some other dumps, please ask and I will attach.
Artem-B commented 6 months ago

How large is the IR we end up trying to compile? Indiscriminately inlining everything may result in the code size explosion. The source file looks template-heavy, and it's possible that we may be inlining way too much because the user requested it. It's quite possible that it's not a regression, but, rather that we have actually fixed the behavior of -amdgpu-early-inline-all=true

@yxsamliu Sam, do you know what's the story with -amdgpu-early-inline-all=true ? It looks like something that can easily backfire on large/complicated enough code.

bcahoon commented 6 months ago

Looks like this is related to https://github.com/llvm/llvm-project/issues/59126, though that issue is about the target independent always-inline pass. That issue contains a couple of test cases too.

AngryLoki commented 6 months ago

Yes, very likely to be related, but it looks like #59126 does not fully reflect all the changes. It says that in Nov 22, 2022 after changes that were pushed before LLVM-14 release, users experienced time explosion with Alwaysinliner.

However for my case everything is ok before LLVM-18 release. Maybe before LLVM-18 -mllvm -amdgpu-early-inline-all=true was nonfunctional/placebo in clang, because

/usr/lib/llvm/17/bin/clang $FLAGS -S -emit-llvm -o /dev/stdout | md5sum
dfac0099986317d8731012f8d6e7a11c  - # 15M .ll file

/usr/lib/llvm/17/bin/clang $FLAGS  -mllvm -amdgpu-early-inline-all=true -S -emit-llvm -o /dev/stdout | md5sum
dfac0099986317d8731012f8d6e7a11c  - # 15M .ll file

/usr/lib/llvm/18/bin/clang $FLAGS -S -emit-llvm -o /dev/stdout | md5sum
5f8fb8b9c7b1a25f2669de75587845a3  - # 13M .ll file

/usr/lib/llvm/18/bin/clang $FLAGS -mllvm -amdgpu-early-inline-all=true -S -emit-llvm -o /dev/stdout | md5sum
a60a9a166226cf36898c8c470ef4be0f  - # 12M .ll file
bcahoon commented 6 months ago

The initial commit mention in https://github.com/llvm/llvm-project/issues/59126 was reverted and then it re-landed on Oct 29, 2023 https://github.com/llvm/llvm-project/commit/1a2e77cf9e11dbf56b5720c607313a566eebb16e. Commenting out the code that adds AlwaysInlinerPass https://github.com/llvm/llvm-project/blob/9f0321ccf118b37e5cb93cabd2acbf600c36b6ee/llvm/lib/Passes/PassBuilderPipelines.cpp#L1168 does reduce the compile-time and memory usage. But, yes, there is an interaction between the amdgpu-early-inilne-all flag and the addition of AlwaysInlinerPass so early in the optimization pipeline. It's interesting that when those passes run later in the pipeline, there isn't an issue.

yxsamliu commented 6 months ago

@arsenm @scchan Any insights? Thanks

arsenm commented 6 months ago

We should just delete the flag, and fully delete AMDGPUAlwaysInlinePass. These are vestiges from before function calls were supported. Forcibly inlining everything is going to make every function bigger and slower to compile. I don't know what to do other than general large function compile time improvements.

JonChesterfield commented 5 months ago

Deleting the always inline pass sounds sensible to me. If that's a horrendous regression for someone maybe we can add a clang flag that tags everything with attribute(always_inline) instead - that should be similar in effect to the custom pass, plausibly useful on some other targets, still allow us to delete that pass.

arsenm commented 5 months ago

Looks like this is related to #59126, though that issue is about the target independent always-inline pass. That issue contains a couple of test cases too.

But that's the same thing - all this pass is tag every function with alwaysinline and the regular AwaysInline pass does the actual work

dfukalov commented 1 month ago

Hi @AngryLoki would you please that PR #96958 fixes the issue?

AngryLoki commented 6 days ago

Hi, this PR is released in 19.1.0, so I checked it:

# Without -amdgpu-early-inline-all=true
/usr/bin/time -f 'Memory: %M KB, Time: %E' /usr/lib/llvm/18/bin/clang-18 $FLAGS
Memory: 827740 KB, Time: 0:18.21

/usr/bin/time -f 'Memory: %M KB, Time: %E' /usr/lib/llvm/19/bin/clang-19 $FLAGS
Memory: 830096 KB, Time: 0:18.53

# With -amdgpu-early-inline-all=true
/usr/bin/time -f 'Memory: %M KB, Time: %E' /usr/lib/llvm/18/bin/clang-18 $FLAGS -mllvm -amdgpu-early-inline-all=true
Memory: 6411340 KB, Time: 1:05.20

/usr/bin/time -f 'Memory: %M KB, Time: %E' /usr/lib/llvm/19/bin/clang-19 $FLAGS -mllvm -amdgpu-early-inline-all=true
Memory: 3623372 KB, Time: 1:03.93

clang-19.1 now consumes 2x less memory, however is it still 4x more then clang-17. Also it is still as slow as clang-18. Is it possible to improve it?

llvmbot commented 6 days ago

@llvm/issue-subscribers-backend-amdgpu

Author: None (AngryLoki)

There is some kind of regression in `-amdgpu-early-inline-all=true` option, which is set for every HIP application in hipcc. While this option makes no significant performance/memory impact in Clang 17, attempt to migrate to Clang 18.1.0 or nightly Clang 19 build consumes 8x more memory, which makes Clang unusable for HIP (i. e. when multiple compile units consume 10GB each in parallel, there is just not enough RAM eventually, even when compiling for single target GPU arch). Environment: ``` /usr/lib/llvm/17/bin/clang-17 --version | grep version clang version 17.0.6 /usr/lib/llvm/18/bin/clang-18 --version | grep version clang version 18.1.0 /usr/lib/llvm/19/bin/clang-19 --version | grep version clang version 19.0.0git6d3cec01 ``` Common flags (verbose output of [composable-kernel-6.0.2](https://github.com/ROCm/composable_kernel/tree/rocm-6.0.2)): ``` export FLAGS="-cc1 -triple amdgcn-amd-amdhsa -aux-triple x86_64-pc-linux-gnu -emit-obj -disable-free -clear-ast-before-backend -disable-llvm-verifier -discard-value-names -main-file-name device_batchnorm_forward_f32_instance.cpp -mrelocation-model pic -pic-level 2 -fhalf-no-semantic-interposition -mframe-pointer=none -fno-rounding-math -mconstructor-aliases -aux-target-cpu x86-64 -fcuda-is-device -mllvm -amdgpu-internalize-symbols -fcuda-allow-variadic-functions -fvisibility=hidden -fapply-global-visibility-to-externs -mlink-builtin-bitcode /usr/lib/amdgcn/bitcode/hip.bc -mlink-builtin-bitcode /usr/lib/amdgcn/bitcode/ocml.bc -mlink-builtin-bitcode /usr/lib/amdgcn/bitcode/ockl.bc -mlink-builtin-bitcode /usr/lib/amdgcn/bitcode/oclc_daz_opt_off.bc -mlink-builtin-bitcode /usr/lib/amdgcn/bitcode/oclc_unsafe_math_off.bc -mlink-builtin-bitcode /usr/lib/amdgcn/bitcode/oclc_finite_only_off.bc -mlink-builtin-bitcode /usr/lib/amdgcn/bitcode/oclc_correctly_rounded_sqrt_on.bc -mlink-builtin-bitcode /usr/lib/amdgcn/bitcode/oclc_wavefrontsize64_off.bc -mlink-builtin-bitcode /usr/lib/amdgcn/bitcode/oclc_isa_version_1030.bc -mlink-builtin-bitcode /usr/lib/amdgcn/bitcode/oclc_abi_version_400.bc -target-cpu gfx1030 -debugger-tuning=gdb -fdebug-compilation-dir=/var/tmp/portage/sci-libs/composable-kernel-6.0.2/work/composable_kernel-rocm-6.0.2_build -resource-dir /usr/lib/clang/17 -dependency-file library/src/tensor_operation_instance/gpu/batchnorm/CMakeFiles/device_batchnorm_instance.dir/device_batchnorm_forward_f32_instance.cpp.o.d -MT library/src/tensor_operation_instance/gpu/batchnorm/CMakeFiles/device_batchnorm_instance.dir/device_batchnorm_forward_f32_instance.cpp.o -sys-header-deps -internal-isystem /usr/lib/clang/17/include/cuda_wrappers -idirafter /usr/local/include -include __clang_hip_runtime_wrapper.h -include /usr/include/gentoo/fortify.h -include /usr/include/gentoo/maybe-stddefs.h -D CK_ENABLE_BF16 -D CK_ENABLE_BF8 -D CK_ENABLE_FP16 -D CK_ENABLE_FP32 -D CK_ENABLE_FP64 -D CK_ENABLE_FP8 -D CK_ENABLE_INT8 -D USE_PROF_API=1 -D __HIP_PLATFORM_AMD__=1 -D __HIP_PLATFORM_HCC__=1 -I /var/tmp/portage/sci-libs/composable-kernel-6.0.2/work/composable_kernel-rocm-6.0.2/library/include -I /var/tmp/portage/sci-libs/composable-kernel-6.0.2/work/composable_kernel-rocm-6.0.2/include -I /var/tmp/portage/sci-libs/composable-kernel-6.0.2/work/composable_kernel-rocm-6.0.2_build/include -internal-isystem /usr/lib/gcc/x86_64-pc-linux-gnu/13/include/g++-v13 -internal-isystem /usr/lib/gcc/x86_64-pc-linux-gnu/13/include/g++-v13/x86_64-pc-linux-gnu -internal-isystem /usr/lib/gcc/x86_64-pc-linux-gnu/13/include/g++-v13/backward -internal-isystem /usr/lib/gcc/x86_64-pc-linux-gnu/13/include/g++-v13 -internal-isystem /usr/lib/gcc/x86_64-pc-linux-gnu/13/include/g++-v13/x86_64-pc-linux-gnu -internal-isystem /usr/lib/gcc/x86_64-pc-linux-gnu/13/include/g++-v13/backward -internal-isystem /usr/lib/clang/17/include -internal-isystem /usr/local/include -internal-isystem /usr/x86_64-pc-linux-gnu/include -internal-externc-isystem /include -internal-externc-isystem /usr/include -internal-isystem /usr/lib/clang/17/include -internal-isystem /usr/local/include -internal-isystem /usr/x86_64-pc-linux-gnu/include -internal-externc-isystem /include -internal-externc-isystem /usr/include -O3 -std=c++17 -fdeprecated-macro -fno-autolink -ferror-limit 19 -fmessage-length=173 -fhip-new-launch-api -fgnuc-version=4.2.1 -fcxx-exceptions -fexceptions -fcolor-diagnostics -vectorize-loops -vectorize-slp -mllvm -amdgpu-function-calls=false -cuid=aa0b75146f478e4b -fcuda-allow-variadic-functions -faddrsig -D__GCC_HAVE_DWARF2_CFI_ASM=1 -o /tmp/device_batchnorm_forward_f32_instance-gfx1030-437c24.o -x hip /var/tmp/portage/sci-libs/composable-kernel-6.0.2/work/composable_kernel-rocm-6.0.2/library/src/tensor_operation_instance/gpu/batchnorm/device_batchnorm_forward_f32_instance.cpp" ``` Without `-amdgpu-early-inline-all=true` everything is fine: ``` /usr/bin/time -f 'Memory: %M KB, Time: %E' /usr/lib/llvm/17/bin/clang-17 $FLAGS Memory: 818272 KB, Time: 0:20.62 /usr/bin/time -f 'Memory: %M KB, Time: %E' /usr/lib/llvm/18/bin/clang-18 $FLAGS Memory: 830300 KB, Time: 0:18.28 /usr/bin/time -f 'Memory: %M KB, Time: %E' /usr/lib/llvm/19/bin/clang-19 $FLAGS Memory: 861772 KB, Time: 0:22.69 ``` With `-amdgpu-early-inline-all=true` Clang 18 and 19 are hungry and slow: ``` /usr/bin/time -f 'Memory: %M KB, Time: %E' /usr/lib/llvm/17/bin/clang-17 $FLAGS -mllvm -amdgpu-early-inline-all=true Memory: 818240 KB, Time: 0:20.80 /usr/bin/time -f 'Memory: %M KB, Time: %E' /usr/lib/llvm/18/bin/clang-18 $FLAGS -mllvm -amdgpu-early-inline-all=true Memory: 6402824 KB, Time: 1:02.50 /usr/bin/time -f 'Memory: %M KB, Time: %E' /usr/lib/llvm/19/bin/clang-19 $FLAGS -mllvm -amdgpu-early-inline-all=true Memory: 6343976 KB, Time: 1:12.43 ``` I don't provide preprocessed version of [device_batchnorm_forward_f32_instance.cpp](https://github.com/ROCm/composable_kernel/blob/rocm-6.0.2/library/src/tensor_operation_instance/gpu/batchnorm/device_batchnorm_forward_f32_instance.cpp), because for some reason I can't rebuild it after preprocessing (complaints about constexprs). However if you need it or some other dumps, please ask and I will attach.