llvm / llvm-project

The LLVM Project is a collection of modular and reusable compiler and toolchain technologies.
http://llvm.org
Other
29.15k stars 12.02k forks source link

[Windows] clang producing crashing binary from exception not getting copy ctor called when using PCH #53486

Open mattjgalloway opened 2 years ago

mattjgalloway commented 2 years ago

[UPDATE: I found a more minimal example and I've put that as a separate comment below. Preserving the original here as it shows the direction we went in and may help diagnosis.]

We've come across a strange issue when using clang-cl.exe on Windows. It seems to have started somewhere between 10.0.0 and 11.0.0. And I can confirm I am seeing it with the latest, LLVM13.0.0, as well.

Important to note is that the issue only happens when using precompiled headers too.

(I am using clang with the standard library from Visual Studio 2019 with Windows SDK version 10.0.18362.0 and MSVC version 14.28.29333.)

First let me show the minimal test case I've come up with.

crash.cpp:

#include <future>
#include <cstdio>
int main(int argc, char* argv[]) {
  printf("started\n");
  {
    std::future<int> future;
    {
      std::promise<int> promise;
      future = promise.get_future();
    }
  }
  printf("didn't crash\n");
  return 0;
}

pch.hpp:

#include <future>

Then we compile, link and run with clang++.exe like this:

PS1> & 'C:\Program Files\LLVM13.0.0\bin\clang++.exe' -Xclang -emit-pch -o pch.gch pch.hpp
PS1> & 'C:\Program Files\LLVM13.0.0\bin\clang++.exe' -include-pch pch.gch -O0-o crash.exe crash.cpp
PS1> .\crash.exe
started
didn't crash

However if we compile, link and run with clang-cl.exe then it crashes:

PS1> & 'C:\Program Files\LLVM13.0.0\bin\clang-cl.exe' /Fppch.gch /Yc /c pch.hpp
PS1> & 'C:\Program Files\LLVM13.0.0\bin\clang-cl.exe' -Xclang -include-pch -Xclang pch.gch /Od -o crash.exe crash.cpp
PS1> .\crash.exe
started

If you omit the use of PCH but still use clang-cl.exe then it works again:

PS1> & 'C:\Program Files\LLVM13.0.0\bin\clang-cl.exe' /Od -o crash.exe crash.cpp
PS1> .\crash.exe
started
didn't crash

When tracing through why the crash is happening, it seems to be coming from the fact that at runtime, the CatchableType for the future_error exception has no copyFunction. That means that inside _CopyExceptionObject in ucrt, we just memcpy the exception data, which is wrong because the future_error needs the proper copy constructor to run. It seems that the compiler emitting nullptr for copyFunction in the clang-cl.exe+PCH case is wrong. That's my understanding at least thus far!

mattjgalloway commented 2 years ago

I've done a bit more digging on this and seem to have come up with an even more minimal test case, this time not using std::future/std::promise at all. And this time it crashes when compiled with plain clang.exe and also clang-cl.exe. It still crashes when using a PCH and doesn't crash when not using a PCH. Hopefully this is a little easier to digest therefore.

thing.hpp:

#pragma once

#include <cstdlib>

struct Thing {
  Thing() { mem_ = malloc(1); }
  ~Thing() { free(mem_); }

  Thing& operator=(const Thing& other) {
    if (this == &other) return *this;
    mem_ = malloc(1);
    return *this;
  }
  Thing(const Thing& other) {
    mem_ = malloc(1);
  }

  void* mem_;
};

pch.hpp:

#include "thing.hpp"

void foo() {
  try {
    Thing t;
    throw t;
  } catch (const Thing& t) {}
}

crash.cpp:

#include <cstdio>
#include "thing.hpp"

int main(int argc, char* argv[]) {
  printf("started\n");
  try {
    Thing t;
    throw t;
  } catch (Thing t) {}
  printf("didn't crash\n");
  return 0;
}

Then compiling with a PCH:

PS1> & 'C:\Program Files\LLVM13.0.0\bin\clang++.exe' -Xclang -emit-pch -o pch.gch pch.hpp
PS1> & 'C:\Program Files\LLVM13.0.0\bin\clang++.exe' -include-pch pch.gch -O0 -o crash.exe crash.cpp
PS1> .\crash.exe
started

Without a PCH:

PS1> & 'C:\Program Files\LLVM13.0.0\bin\clang++.exe' -O0 -o crash.exe crash.cpp
PS1> .\crash.exe
started
didn't crash

Again it’s crashing because the copy ctor of Thing was not called when it needed to be. That results in a double free of the mem_ member of Thing.

Note that it will also not crash, even with the PCH, if in crash.cpp you catch the exception by reference. But that's because there's no exception copying going on. The catch by value is there to force the copy ctor to need to be called.

mattjgalloway commented 2 years ago

I took a look at the LLVM in the case of using a PCH and not using a PCH and I can see a difference in how the CatchableType for Thing is represented:

Not crashy, copy ctor called:

@"_CT??_R0?AUThing@@@8??0Thing@@QEAA@AEBU0@@Z8" = linkonce_odr unnamed_addr constant %eh.CatchableType { i32 0, i32 trunc (i64 sub nuw nsw (i64 ptrtoint (%rtti.TypeDescriptor11* @"??_R0?AUThing@@@8" to i64), i64 ptrtoint (i8* @__ImageBase to i64)) to i32), i32 0, i32 -1, i32 0, i32 8, i32 trunc (i64 sub nuw nsw (i64 ptrtoint (%struct.Thing* (%struct.Thing*, %struct.Thing*)* @"??0Thing@@QEAA@AEBU0@@Z" to i64), i64 ptrtoint (i8* @__ImageBase to i64)) to i32) }, section ".xdata", comdat

Crash, copy ctor not called:

@"_CT??_R0?AUThing@@@88" = linkonce_odr unnamed_addr constant %eh.CatchableType { i32 0, i32 trunc (i64 sub nuw nsw (i64 ptrtoint (%rtti.TypeDescriptor11* @"??_R0?AUThing@@@8" to i64), i64 ptrtoint (i8* @__ImageBase to i64)) to i32), i32 0, i32 -1, i32 0, i32 8, i32 0 }, section ".xdata", comdat

I cannot claim to understand LLVM IR enough at this point to understand how to debug further, but I think I can glean enough here to see that this looks wrong. In the crashy case (i.e. using a PCH) we see that the CatchableType doesn't have the final field set. Whereas in the non-crashy case it's set to, well, the copy ctor of Thing (demangled, it's public: __cdeclThing::Thing(struct Thing const & __ptr64) __ptr64).

I hope this helps someone with more understanding to get to the bottom of the problem.

smeenai commented 2 years ago

CC @rnk, in case this rings a bell.

WenleiHe commented 2 years ago

cc @tentzen who's familiar with EH for windows in LLVM.

tentzen commented 2 years ago

Looks like it's related to COMDAT Folding (/OPT:ICF). It seems a Clang's linker bug.

The xdata objects, one from precompiled header and another one from crash.obj, should not be merged: "CTA1?AUThing@@": // from pch.gch .long 1 # 0x1 .long "CT??_R0?AUThing@@@84"

"CTA1?AUThing@@": // // from crash.obj .long 1 # 0x1 .long "CT??_R0?AUThing@@@8??0Thing@@QAE@ABU0@@Z4"

Try use the option in clang++ (if there is one) corresponding to CL's -Gy- to get around the problem.

mattjgalloway commented 2 years ago

@tentzen Many thanks for the reply! I've tried with /Gy- on the clang-cl driver path, and unfortunately, that doesn't help mitigate the problem.

I can't claim to know anywhere near as much as you here, but I'm unsure it's a linker problem if I see the LLVM that I mention in a comment above? Seems to indicate to me that it's missing at that point. Could it be related to pulling over the copy ctor information from the PCH somehow?

rnk commented 2 years ago

I put this on our bug scrub list to get someone to take a look at this. (@zmodem)

glebov-andrey commented 1 year ago

We're seeing the same behavior with LLVM 16.0.6 (although for some reason only in Release builds). I also tried mattjgalloway's test and it crashes with both 16.0.6 and 17.0.2. Has there been any progress on this issue?

glebov-andrey commented 1 year ago

I've done some debugging, and I think I've figured out what's going on. It appears to be an issue with serializing and deserializing the AST, not with COMDAT folding. If this is the case then it might also affect C++ Module support as that uses some of the same machinery.

It all comes down to CodeGen's MicrosoftCXXABI::getCatchableType() not seeing T's copy constructor when processing the PCH. This call to ASTContext::getCopyConstructorForExceptionObject() returns NULL https://github.com/llvm/llvm-project/blob/970e7456e0346f078962fa88bdc1b86fd9828d6f/clang/lib/CodeGen/MicrosoftCXXABI.cpp#L4171-L4172 That function ends up just looking up the constructor in AST's MicrosoftCXXABI::RecordToCopyCtor map https://github.com/llvm/llvm-project/blob/970e7456e0346f078962fa88bdc1b86fd9828d6f/clang/lib/AST/MicrosoftCXXABI.cpp#L150-L152 RecordToCopyCtor seems to be poulated correctly while initially parsing the C++ code in the -emit-pch invocation, but not when loading the PCH in the -include-pch invocation. Interestingly using -fpch-codegen doesn't actually help because internally it gets separated into two compiler invocations - one with -emit-pch and another with -include-pch.

The linker presumably just chooses the first CatchableTypeArray symbol it sees, which hapens to be the one from the PCH (referencing the incorrectly generated CatchableType symbol _CT??_R0?AUThing@@@88).

I don't know who to tag or how best to proceed, so sorry for the spam: @smeenai @WenleiHe @tentzen @rnk @zmodem

rnk commented 1 year ago

Nice debugging!

What we're doing here looks like a sparse field pattern: The copy ctor is effectively a field of the record decl, but we store it separately in a side table instead of making it a real field of the record itself. That suggests we can fix the issue by storing this "copy constructor for exception object" in the serialized CXXRecordDecl, even if during deserialization we put the data back into this side table. That seems like a reasonably natural serialization scheme, rather than storing the entire side table as a new record. Thoughts?

glebov-andrey commented 5 months ago

Hi, is there any more info about this? I recently tested (with LLVM 18.1.7), and this issue does indeed affect C++ modules, effectively making them unusable on Windows.