llvm / llvm-project

The LLVM Project is a collection of modular and reusable compiler and toolchain technologies.
http://llvm.org
Other
28.03k stars 11.58k forks source link

Abort/segfault when exiting programs using mesa and llvm version >= 15 #60361

Open jonemil opened 1 year ago

jonemil commented 1 year ago

I debugged a crash I experienced, but it is also reported in multiple places including here https://github.com/ValveSoftware/steam-for-linux/issues/8853

The hardware I'm running on is: Vendor: AMD (0x1002) Device: AMD Radeon RX 590 Series (polaris10, LLVM 15.0.7, DRM 3.49, 6.1.7-200.fc37.x86_64) (0x67df)

The issue itself is a program supplied by presumably Valve called gldriverquery crashes on exit. In addition to this, the symptom of corrupted rendering was seen, which has been debugged here: https://gitlab.com/freedesktop-sdk/freedesktop-sdk/-/issues/1542 From what I can tell it seems to be believed that the graphics corruption is not related, the reports state that Mesa was bisected and resolved on that side. The SHA1 checksum of the file is: d31a82db7ede30e1e32849b614aaf04d263dc642 .var/app/com.valvesoftware.Steam/.local/share/Steam/ubuntu12_64/gldriverquery It's part of the Flatpak Steam release version 1.0.0.75 The problem was discovered when running the Flatpak Steam version (and other programs) using Mesa within Flatpak org.freedesktop.Platform.GL.default version 22.3.3. This version was updated to use LLVM 15, the previous platform version 22.3.2 was built using LLVM 14.

I assumed in my bisection that the problem I was looking for was on the LLVM side, which I see is also mentioned in the Valve ticket mentioned above.

I bisected from llvmorg-14.0.6 with mesa-22.3.3, rebuilding and cleaning after each step, executing the gldriverquery program to know when the state is good or not. LLVM was built with the following commands:

cmake -S llvm -B build64 -G Ninja -DCMAKE_BUILD_TYPE=Debug -DLLVM_USE_LINKER=lld -DLLVM_ENABLE_RTTI=ON -DCMAKE_INSTALL_PREFIX=$HOME/mesa -DLLVM_LIBDIR_SUFFIX=64 -DLLVM_TARGETS_TO_BUILD="AMDGPU" -DLLVM_OPTIMIZED_TABLEGEN=ON -DLLVM_BUILD_LLVM_DYLIB=ON -DLLVM_LINK_LLVM_DYLIB=ON -DLLVM_INCLUDE_EXAMPLES=OFF -DLLVM_ENABLE_PROJECTS=clang
ninja -C build64 -j 12 install

Mesa with the following:

meson build64 --libdir lib64 --prefix $HOME/mesa -Ddri-drivers= -Dgallium-drivers=radeonsi,swrast,zink -Dvulkan-drivers=amd -Dgallium-nine=true -Dosmesa=false -Dbuildtype=debug --native-file=my-llvm-x64
ninja -C build64 -j 5 install

my-llvm-x64:

[binaries]
llvm-config = '/home/jon/mesa/bin/llvm-config'

[cmake]
CMAKE_MODULE_PATH = '/home/jon/lib/cmake/clang'

The offending commit e6f1f062457c928c18a88c612f39d9e168f65a85 was the first bad I found. I bisected this further to find the exact changes which seem to need to be reverted to avoid abort/segfault, the following are the reversions I made:

From 092b255d296df4784890f9e66ddf0bb56b452e9b Mon Sep 17 00:00:00 2001
From: Jon Emil Jahren <jonemilj@gmail.com>
Date: Sun, 29 Jan 2023 02:42:48 +0100
Subject: [PATCH] Revert changes causing ctor/dtor issues

Partially revert e6f1f062457c928c18a88c612f39d9e168f65a85
---
 llvm/lib/CodeGen/SelectionDAG/SelectionDAG.cpp | 15 ++++++++-------
 llvm/lib/IR/PassRegistry.cpp                   | 10 ++++++++--
 2 files changed, 16 insertions(+), 9 deletions(-)

diff --git a/llvm/lib/CodeGen/SelectionDAG/SelectionDAG.cpp b/llvm/lib/CodeGen/SelectionDAG/SelectionDAG.cpp
index 195c0e6a836f..2bbd2bf762e0 100644
--- a/llvm/lib/CodeGen/SelectionDAG/SelectionDAG.cpp
+++ b/llvm/lib/CodeGen/SelectionDAG/SelectionDAG.cpp
@@ -61,6 +61,7 @@
 #include "llvm/Support/ErrorHandling.h"
 #include "llvm/Support/KnownBits.h"
 #include "llvm/Support/MachineValueType.h"
+#include "llvm/Support/ManagedStatic.h"
 #include "llvm/Support/MathExtras.h"
 #include "llvm/Support/Mutex.h"
 #include "llvm/Support/raw_ostream.h"
@@ -10832,19 +10833,19 @@ namespace {

 } // end anonymous namespace

+static ManagedStatic<std::set<EVT, EVT::compareRawBits>> EVTs;
+static ManagedStatic<EVTArray> SimpleVTArray;
+static ManagedStatic<sys::SmartMutex<true>> VTMutex;
+
 /// getValueTypeList - Return a pointer to the specified value type.
 ///
 const EVT *SDNode::getValueTypeList(EVT VT) {
-  static std::set<EVT, EVT::compareRawBits> EVTs;
-  static EVTArray SimpleVTArray;
-  static sys::SmartMutex<true> VTMutex;
-
   if (VT.isExtended()) {
-    sys::SmartScopedLock<true> Lock(VTMutex);
-    return &(*EVTs.insert(VT).first);
+    sys::SmartScopedLock<true> Lock(*VTMutex);
+    return &(*EVTs->insert(VT).first);
   }
   assert(VT.getSimpleVT() < MVT::VALUETYPE_SIZE && "Value type out of range!");
-  return &SimpleVTArray.VTs[VT.getSimpleVT().SimpleTy];
+  return &SimpleVTArray->VTs[VT.getSimpleVT().SimpleTy];
 }

 /// hasNUsesOfValue - Return true if there are exactly NUSES uses of the
diff --git a/llvm/lib/IR/PassRegistry.cpp b/llvm/lib/IR/PassRegistry.cpp
index 6c22fcd34769..94f607afec47 100644
--- a/llvm/lib/IR/PassRegistry.cpp
+++ b/llvm/lib/IR/PassRegistry.cpp
@@ -15,15 +15,21 @@
 #include "llvm/ADT/STLExtras.h"
 #include "llvm/Pass.h"
 #include "llvm/PassInfo.h"
+#include "llvm/Support/ManagedStatic.h"
 #include <cassert>
 #include <memory>
 #include <utility>

 using namespace llvm;

+// FIXME: We use ManagedStatic to erase the pass registrar on shutdown.
+// Unfortunately, passes are registered with static ctors, and having
+// llvm_shutdown clear this map prevents successful resurrection after
+// llvm_shutdown is run.  Ideally we should find a solution so that we don't
+// leak the map, AND can still resurrect after shutdown.
+static ManagedStatic<PassRegistry> PassRegistryObj;
 PassRegistry *PassRegistry::getPassRegistry() {
-  static PassRegistry PassRegistryObj;
-  return &PassRegistryObj;
+  return &*PassRegistryObj;
 }

 //===----------------------------------------------------------------------===//
-- 
2.39.1

When running the crashing program over and over, there seem to be some kind of undefined behaviour, so the crash can either be an abort

gldriverquery: /home/jon/projects/llvm-project/llvm/include/llvm/PassInfo.h:99: llvm::Pass* llvm::PassInfo::createPass() const: Assertion `NormalCtor && "Cannot call createPass on PassInfo without default ctor!"' failed.

gldriverquery_abort.txt or segfault gldriverquery_segfault.txt

I tested the partial reversions on top of llvmorg-15.0.7, and at least with respect to the gldriverquery program it works. I also tested the reversions I was left with individually, with only the PassRegistry revert in place, it just failed a bit later with the following abort:

gldriverquery: /home/jon/projects/llvm-project/llvm/lib/CodeGen/SelectionDAG/SelectionDAGBuilder.cpp:10452: void llvm::SelectionDAGISel::LowerArguments(const llvm::Function&): Assertion `NewRoot.getNode() && NewRoot.getValueType() == MVT::Other && "LowerFormalArguments didn't return a valid chain!"' failed.

Unfortunately I don't even know where to begin on how to create a reproducible version of this, and it may be for all I know that Mesa is partially responsible, however I believe there is something weird going on as can be seen by the FIXME comment made which was removed by the bad commit. So my hypothesis is that Mesa is not doing something wrong and it's the ctor/dtor issue referred to by the FIXME which hasn't been addressed properly. But perhaps someone more familiar with it will be able to look at the highlighted code and can get some value from the report regardless of a missing simpler way of reproducing it. For what it's worth the workaround removed seem to go far back from ee3570f0ff2e6fc47eae9c417503709d9031a722

EugeneZelenko commented 1 year ago

Could you please try 16 Release Candidate or main branch?.

jonemil commented 1 year ago

I can reproduce abort/segfault on llvmorg-16.0.0-rc1 as well, same abort message (when it happens), and the stack traces look the same.