[OpenMP] Different fail modes for memory_manager.cpp OpenMP test

doru1004 commented 10 months ago

I investigated the memory_manager.cpp test in OpenMP and it looks like, on AMD GPUs, it fails in different ways for different optimization levels.

For -O2 and -O3 the test passes consistently. For -O1 or no optimization level specified it fails occasionally with a GPU memory error. For -O0 it doesn't even compile, the trace is below:

clang-linker-wrapper: /home/dobercea/upstream/llvm-project/llvm/lib/Target/AMDGPU/AMDGPUResourceUsageAnalysis.cpp:154: virtual bool llvm::AMDGPUResourceUsageAnalysis::runOnModule(llvm::Module&): Assertion `MF && "function must have been generated already"' failed.
PLEASE submit a bug report to https://github.com/llvm/llvm-project/issues/ and include the crash backtrace.
Stack dump:
0.      Program arguments: /home/dobercea/upstream/llvm-project/build/./bin/clang-linker-wrapper --opt-level=O0 --host-triple=x86_64-unknown-linux-gnu --linker-path=/home/dobercea/upstream/llvm-project/build/./bin/ld.lld -- -pie -z relro --hash-style=gnu --eh-frame-hdr -m elf_x86_64 -dynamic-linker /lib64/ld-linux-x86-64.so.2 -o /home/dobercea/upstream/llvm-project/build/runtimes/runtimes-bins/openmp/libomptarget/test/amdgcn-amd-amdhsa/offloading/Output/memory_manager.cpp.tmp /lib/x86_64-linux-gnu/Scrt1.o /lib/x86_64-linux-gnu/crti.o /usr/lib/gcc/x86_64-linux-gnu/9/crtbeginS.o -L/home/dobercea/upstream/llvm-project/build/runtimes/runtimes-bins/openmp/libomptarget -L/home/dobercea/upstream/llvm-project/build/./lib -L/home/dobercea/upstream/llvm-project/build/runtimes/runtimes-bins/openmp/runtime/src -L/home/dobercea/upstream/llvm-project/build/lib/clang/18/lib/x86_64-unknown-linux-gnu -L/usr/lib/gcc/x86_64-linux-gnu/9 -L/usr/lib/gcc/x86_64-linux-gnu/9/../../../../lib64 -L/lib/x86_64-linux-gnu -L/lib/../lib64 -L/usr/lib/x86_64-linux-gnu -L/usr/lib/../lib64 -L/lib -L/usr/lib -rpath /home/dobercea/upstream/llvm-project/build/runtimes/runtimes-bins/openmp/libomptarget -rpath /home/dobercea/upstream/llvm-project/build/runtimes/runtimes-bins/openmp/runtime/src -rpath /home/dobercea/upstream/llvm-project/build/./lib /tmp/memory_manager-1120de.o -lstdc++ -lm -lomp -lomptarget -lomptarget.devicertl -L/home/dobercea/upstream/llvm-project/build/lib -lgcc_s -lgcc -lpthread -lc -lgcc_s -lgcc /usr/lib/gcc/x86_64-linux-gnu/9/crtendS.o /lib/x86_64-linux-gnu/crtn.o
1.      Running pass 'Function register usage analysis' on module 'ld-temp.o'.
 #0 0x00005606ddf20d54 PrintStackTraceSignalHandler(void*) Signals.cpp:0:0
 #1 0x00005606ddf1e584 SignalHandler(int) Signals.cpp:0:0
 #2 0x00007fd634757420 __restore_rt (/lib/x86_64-linux-gnu/libpthread.so.0+0x14420)
 #3 0x00007fd6341f400b raise /build/glibc-SzIz7B/glibc-2.31/signal/../sysdeps/unix/sysv/linux/raise.c:51:1
 #4 0x00007fd6341d3859 abort /build/glibc-SzIz7B/glibc-2.31/stdlib/abort.c:81:7
 #5 0x00007fd6341d3729 get_sysdep_segment_value /build/glibc-SzIz7B/glibc-2.31/intl/loadmsgcat.c:509:8
 #6 0x00007fd6341d3729 _nl_load_domain /build/glibc-SzIz7B/glibc-2.31/intl/loadmsgcat.c:970:34
 #7 0x00007fd6341e4fd6 (/lib/x86_64-linux-gnu/libc.so.6+0x33fd6)
 #8 0x00005606dd257095 (/home/dobercea/upstream/llvm-project/build/./bin/clang-linker-wrapper+0x75e095)
 #9 0x00005606dd8a9252 llvm::legacy::PassManagerImpl::run(llvm::Module&) (/home/dobercea/upstream/llvm-project/build/./bin/clang-linker-wrapper+0xdb0252)
#10 0x00005606de4d83a5 codegen(llvm::lto::Config const&, llvm::TargetMachine*, std::function<llvm::Expected<std::unique_ptr<llvm::CachedFileStream, std::default_delete<llvm::CachedFileStream>>> (unsigned int, llvm::Twine const&)>, unsigned int, llvm::Module&, llvm::ModuleSummaryIndex const&) LTOBackend.cpp:0:0
#11 0x00005606de4d897d llvm::lto::backend(llvm::lto::Config const&, std::function<llvm::Expected<std::unique_ptr<llvm::CachedFileStream, std::default_delete<llvm::CachedFileStream>>> (unsigned int, llvm::Twine const&)>, unsigned int, llvm::Module&, llvm::ModuleSummaryIndex&) (/home/dobercea/upstream/llvm-project/build/./bin/clang-linker-wrapper+0x19df97d)
#12 0x00005606de4cf004 llvm::lto::LTO::runRegularLTO(std::function<llvm::Expected<std::unique_ptr<llvm::CachedFileStream, std::default_delete<llvm::CachedFileStream>>> (unsigned int, llvm::Twine const&)>) (/home/dobercea/upstream/llvm-project/build/./bin/clang-linker-wrapper+0x19d6004)
#13 0x00005606de4cf678 llvm::lto::LTO::run(std::function<llvm::Expected<std::unique_ptr<llvm::CachedFileStream, std::default_delete<llvm::CachedFileStream>>> (unsigned int, llvm::Twine const&)>, std::function<llvm::Expected<std::function<llvm::Expected<std::unique_ptr<llvm::CachedFileStream, std::default_delete<llvm::CachedFileStream>>> (unsigned int, llvm::Twine const&)>> (unsigned int, llvm::StringRef, llvm::Twine const&)>) (/home/dobercea/upstream/llvm-project/build/./bin/clang-linker-wrapper+0x19d6678)
#14 0x00005606dce28708 (anonymous namespace)::linkBitcodeFiles(llvm::SmallVectorImpl<llvm::object::OffloadFile>&, llvm::SmallVectorImpl<llvm::StringRef>&, llvm::opt::ArgList const&) (.constprop.0) ClangLinkerWrapper.cpp:0:0
#15 0x00005606dce2f35a llvm::Error (anonymous namespace)::linkAndWrapDeviceFiles(llvm::SmallVectorImpl<llvm::object::OffloadFile>&, llvm::opt::InputArgList const&, char**, int)::'lambda'(auto&)::operator()<llvm::SmallVector<llvm::object::OffloadFile, 3u>>(auto&) const ClangLinkerWrapper.cpp:0:0
#16 0x00005606dce357a6 (anonymous namespace)::linkAndWrapDeviceFiles(llvm::SmallVectorImpl<llvm::object::OffloadFile>&, llvm::opt::InputArgList const&, char**, int) ClangLinkerWrapper.cpp:0:0
#17 0x00005606dcd7d709 main (/home/dobercea/upstream/llvm-project/build/./bin/clang-linker-wrapper+0x284709)
#18 0x00007fd6341d5083 __libc_start_main /build/glibc-SzIz7B/glibc-2.31/csu/../csu/libc-start.c:342:3
#19 0x00005606dce17eee _start (/home/dobercea/upstream/llvm-project/build/./bin/clang-linker-wrapper+0x31eeee)

llvmbot commented 10 months ago

@llvm/issue-subscribers-openmp

shiltian commented 10 months ago

This doesn't look like a openmp issue. A AMDGPU backend related issue instead.

llvmbot commented 10 months ago

@llvm/issue-subscribers-backend-amdgpu

doru1004 commented 10 months ago

This doesn't look like a openmp issue. A AMDGPU backend related issue instead.

It may be an OpenMP issue for O1 and default optimization level.

jdoerfert commented 10 months ago

This doesn't look like a openmp issue. A AMDGPU backend related issue instead.

It may be an OpenMP issue for O1 and default optimization level.

It might be. It seems like 3 different problems. Let's fix them one by one. The crash in the backend is the easiest. Once that runs we should figure our why no O flag is not equivalent to O0. Then, we can dive into the O1 test case trying to understand why it crashes.

doru1004 commented 10 months ago

Update: adding device(0) clauses fixes the intermittent fails for -O1. For no optimization level the intermittent fails are reduced from 4/5 fails every 100 tries to 1/2 every 500 tries which suggest there may be more than one issue causing intermittent fails so at least one remains. The compilation fail at -O0 is unchanged.

llvm / llvm-project

[OpenMP] Different fail modes for memory_manager.cpp OpenMP test #65077