Open nkoukpaizan opened 7 months ago
Tracked the issue down to an architecture-specific portion of the PETSc code. The Spack environment has target: [zen3]
, such that it appends -march=znver3 -mtune=znver3
to the compiler flags. With these flags, the compiler throws the error on a portion of code guarded by #if defined(__AVX2__) && defined(__FMA__) ...
. The code compiles fine if that section of code is not compiled (-march=x86-64 -mtune=generic
when target: [x86-64]
in Spack).
I am now able to reproduce the issue outside of my Spack environment by adding --CFLAGS="-march=znver3 -mtune=znver3"
to the PETSc configuration line. My reproducer compiles with amdclang from amd/5.2 through amd/5.5.1, but not with amd/5.6 and amd/5.7. That tells me it's a regression in the compiler.
I'll simplify the reproducer so that I can file a bug report with OLCF and AMD. I am also rebuilding our software stack with target: [x86-64]
, though that may have a negative impact on performance.
CC: @pelesh @cameronrutherford
Huh. Fascinating. Hopefully we can fix this in the newest version of that compiler toolchain. cc @balay
cc: @jczhang07
@balay it failed on code not related to GPU. As mentioned above, it seems like a compiler bug. I am not sure how we can do at petsc side to work around that (with --CFLAGS="-march=znver3 -mtune=znver3", and amdclang 5.6+)
fatal error: error in backend: Instruction Combining seems stuck in an infinite loop after 1000 iterations. PLEASE submit a bug report to https://github.com/llvm/llvm-project/issues/ and include the crash backtrace, preprocessed source, and associated run script. Stack dump:
- Program arguments: /opt/rocm-5.6.0/llvm/bin/clang -march=znver3 -mtune=znver3 -I/lustre/orion/csc359/proj-shared/nkouk/spack-cache/build-stage/spack-stage-petsc-3.19.6-a75cj3sn6y5jpcseh5zemwmih53g2oto/spack-src/include -I/lustre/orion/csc359/proj-shared/nkouk/spack-cache/build-stage/spack-stage-petsc-3.19.6-a75cj3sn6y5jpcseh5zemwmih53g2oto/spack-src/arch-linux-c-opt/include -I/opt/rocm-5.6.0/include -I/opt/cray/pe/mpich/8.1.25/ofi/gnu/9.1/include -I/lustre/orion/csc359/proj-shared/nkouk/spack-install/linux-sles15-zen3/gcc-12.2.0-mixed/openblas-0.3.20-7hydqmqje2llj2tehcwgr55bhtp5bul2/include -I/opt/rocm-5.6.0/include -I/opt/rocm-5.6.0/llvm/include -I/opt/cray/pe/mpich/8.1.25/ofi/gnu/9.1/include -c -fPIC -Wall -Wwrite-strings -Wno-unknown-pragmas -fstack-protector -Qunused-arguments -fvisibility=hidden -g -O3 -MMD -MP /lustre/orion/csc359/proj-shared/nkouk/spack-cache/build-stage/spack-stage-petsc-3.19.6-a75cj3sn6y5jpcseh5zemwmih53g2oto/spack-src/src/mat/impls/baij/seq/baij2.c -o arch-linux-c-opt/obj/mat/impls/baij/seq/baij2.o
Issue type
Relates to
Summary
While attempting to upgrade to ROCm 5.6 on Frontier (see nicholson/frontier-rocm5.6), PETSc fails to build.
The error is an ICE (Internal Compiler Error):
>> 2973 fatal error: error in backend: Instruction Combining seems stuck in an infinite loop after 1000 iterations.
>> 3021 clang-16: error: clang frontend command failed with exit code 70 (use -v to see invocation)
Full log: spack-build-out.txt
I'll try a few more things. Building PETSc from source outside of the Spack environment seems to work fine with ROCm 5.6.