pnnl / ExaGO

High-performance power grid optimization for stochastic, security-constrained, and multi-period ACOPF problems.
Other
60 stars 9 forks source link

Spack environment with ROCm 5.6 fails to build PETSc #65

Open nkoukpaizan opened 7 months ago

nkoukpaizan commented 7 months ago

Issue type

Relates to

Summary

While attempting to upgrade to ROCm 5.6 on Frontier (see nicholson/frontier-rocm5.6), PETSc fails to build.

The error is an ICE (Internal Compiler Error): >> 2973 fatal error: error in backend: Instruction Combining seems stuck in an infinite loop after 1000 iterations. >> 3021 clang-16: error: clang frontend command failed with exit code 70 (use -v to see invocation)

Full log: spack-build-out.txt

I'll try a few more things. Building PETSc from source outside of the Spack environment seems to work fine with ROCm 5.6.

nkoukpaizan commented 7 months ago

Tracked the issue down to an architecture-specific portion of the PETSc code. The Spack environment has target: [zen3], such that it appends -march=znver3 -mtune=znver3 to the compiler flags. With these flags, the compiler throws the error on a portion of code guarded by #if defined(__AVX2__) && defined(__FMA__) .... The code compiles fine if that section of code is not compiled (-march=x86-64 -mtune=generic when target: [x86-64] in Spack).

I am now able to reproduce the issue outside of my Spack environment by adding --CFLAGS="-march=znver3 -mtune=znver3" to the PETSc configuration line. My reproducer compiles with amdclang from amd/5.2 through amd/5.5.1, but not with amd/5.6 and amd/5.7. That tells me it's a regression in the compiler.

I'll simplify the reproducer so that I can file a bug report with OLCF and AMD. I am also rebuilding our software stack with target: [x86-64], though that may have a negative impact on performance.

CC: @pelesh @cameronrutherford

cameronrutherford commented 7 months ago

Huh. Fascinating. Hopefully we can fix this in the newest version of that compiler toolchain. cc @balay

balay commented 7 months ago

cc: @jczhang07

jczhang07 commented 7 months ago

@balay it failed on code not related to GPU. As mentioned above, it seems like a compiler bug. I am not sure how we can do at petsc side to work around that (with --CFLAGS="-march=znver3 -mtune=znver3", and amdclang 5.6+)

fatal error: error in backend: Instruction Combining seems stuck in an infinite loop after 1000 iterations. PLEASE submit a bug report to https://github.com/llvm/llvm-project/issues/ and include the crash backtrace, preprocessed source, and associated run script. Stack dump:

  1. Program arguments: /opt/rocm-5.6.0/llvm/bin/clang -march=znver3 -mtune=znver3 -I/lustre/orion/csc359/proj-shared/nkouk/spack-cache/build-stage/spack-stage-petsc-3.19.6-a75cj3sn6y5jpcseh5zemwmih53g2oto/spack-src/include -I/lustre/orion/csc359/proj-shared/nkouk/spack-cache/build-stage/spack-stage-petsc-3.19.6-a75cj3sn6y5jpcseh5zemwmih53g2oto/spack-src/arch-linux-c-opt/include -I/opt/rocm-5.6.0/include -I/opt/cray/pe/mpich/8.1.25/ofi/gnu/9.1/include -I/lustre/orion/csc359/proj-shared/nkouk/spack-install/linux-sles15-zen3/gcc-12.2.0-mixed/openblas-0.3.20-7hydqmqje2llj2tehcwgr55bhtp5bul2/include -I/opt/rocm-5.6.0/include -I/opt/rocm-5.6.0/llvm/include -I/opt/cray/pe/mpich/8.1.25/ofi/gnu/9.1/include -c -fPIC -Wall -Wwrite-strings -Wno-unknown-pragmas -fstack-protector -Qunused-arguments -fvisibility=hidden -g -O3 -MMD -MP /lustre/orion/csc359/proj-shared/nkouk/spack-cache/build-stage/spack-stage-petsc-3.19.6-a75cj3sn6y5jpcseh5zemwmih53g2oto/spack-src/src/mat/impls/baij/seq/baij2.c -o arch-linux-c-opt/obj/mat/impls/baij/seq/baij2.o