Libc++ breakage with -fsized-deallocation on MinGW

mstorsjo commented 3 months ago

Since 130e93cc26ca9d3ac50ec5a92e3109577ca2e702, included in the upcoming 19.x branch, Clang now enables -fsized-deallocation by default when targeting C++14 or newer.

Running the libc++ tests with such a clang binary ends up with 27 tests failing, see e.g. https://github.com/mstorsjo/llvm-mingw/actions/runs/9414695194/job/25934131202. (This is in build configurations with libc++ built as a DLL, it's not a problem when libc++ is linked statically.)

The effect of this change, is that when invoking delete, it generates a call to operator delete(void*, size_t) instead of a call to operator delete(void*).

By default, both operator delete(void*) and operator delete(void*, size_t) are provided by libcx++/libc++abi. If the user overrides operator delete(void*), but the compiler calls operator delete(void*, size_t), it should hit the libc++/libc++abi fallback implementation which calls operator delete(void*).

When everything is linked statically, the fallback operator delete(void*, size_t) ends up calling the user provided, overridden operator delete(void*). However, if the fallback operator delete(void*, size_t) resides in libc++.dll, a user provided operator delete(void*) won't get called, but it just calls the default operator delete(void*).

Thus, this change breaks the case when users override only operator delete(void*), not operator delete(void*, size_t) (which the user may not even have heard about), as the compiler generated code no longer ends up calling the user's overridden operator.

This isn't an entirely new issue; the same issue with operator overloading already exists in a number of cases, e.g. where the user overrides operator new(size_t) but the caller calls operator new(size_t, nothrow_t). We have covered these cases in libc++ tests by omitting some individual checks in these cases using ASSERT_WITH_OPERATOR_NEW_FALLBACKS, see https://github.com/llvm/llvm-project/blob/llvmorg-19-init/libcxx/test/support/test_macros.h#L353-L371

There is also another similar situation; if the user overrides operator new/delete, any calls to these functions within libc++.dll will still call the default functions within the DLL, instead of the functions provided by the user code (in another EXE/DLL) - these are waived in tests with ASSERT_WITH_LIBRARY_INTERNAL_ALLOCATIONS in the same file above.

For the fallbacks between different kinds of operators, this situation actually works correctly with MS STL, when linked as a DLL. This is because the import library to the MS STL DLL, msvcprt.lib, isn't a regular pure import library, but it contains a handful of regular object files (which end up statically linked into the user executables, if needed) that provide the fallback operator new/delete, so that fallbacks from a more specialized operator to a less specialized one do work in user code.

CC people involved in MinGW/libc++, @lazka @mati865 @alvinhochun @cjacek @jeremyd2019 - and from the libc++ side @ldionne

With around 4 weeks to go to the 19.x branch date, what direction should we take wrt this for the MinGW target?

I see two aspects of this; we want the libc++ testsuite to still work. But user code, that previously just did override operator delete(void*) but didn't override operator delete(void*, size_t), no longer works as it did, when using libc++ as a DLL. That's possibly a surprising change (even though the amount of projects overriding operators probably is quite small).

I see the following options:

We can revert the Clang change for the MinGW target. Keep defaulting to -fno-sized-deallocation for MinGW targets, just like before. The linked change above does this when targeting zOS already, and when targeting older versions of Darwin. See https://github.com/mstorsjo/llvm-project/commit/clang-no-sized-dealloc-mingw for an example of this.
We can add -fno-sized-deallocation to the MinGW libc++ testsuite configs. This silences the issue in the libc++ tests, but the potential issue for users still exists. See https://github.com/mstorsjo/llvm-project/commit/libcxx-mingw-no-sized-dealloc
We can also add another wrapping around the checks that we know and expect to fail when using a libc++ DLL build, like ASSERT_WITH_OPERATOR_NEW_FALLBACKS, see https://github.com/mstorsjo/llvm-project/commit/libcxx-waive-sized-operator-delete-fallback. The issue remains for users here as well.
We could construct a hybrid import library, with the operator new/delete fallback operators linked statically, just like MS STL does. I have a proof of concept of this ready, see https://github.com/mstorsjo/llvm-project/commit/libcxx-static-operators-import-library. (It's not complete and tested for MSVC style targets yet, and obviously would need to go behind conditionals etc. And this currently relies on llvm-ar specifically for merging static libraries, iirc GNU ar doesn't implement those options.) Technically, this is the best solution, as it makes the operator overloading work as expected - it also would allow getting rid of the existing ASSERT_WITH_OPERATOR_NEW_FALLBACKS check waivers for MinGW/DLL configs. But implementation wise it's quite messy.

With a couple weeks to go to the branch, which way do we go?

ldionne commented 3 months ago

IIUC this is basically a problem with the platform itself and the fact that DLLs don't exhibit the right behavior w.r.t. link-time replacement of functions, is that right?

If so, I am wary of trying too hard to work around the fundamental limitations of DLLs (option 4) since that feels like just applying a band-aid within the library, but users don't actually get the right behavior if they override functions in a similar way themselves.

Instead, wouldn't we better off acknowledging that this doesn't work with DLLs, adjusting the test suite accordingly and ideally issuing a diagnostic and rejecting any user trying to do this?

mstorsjo commented 3 months ago

IIUC this is basically a problem with the platform itself and the fact that DLLs don't exhibit the right behavior w.r.t. link-time replacement of functions, is that right?

Yes, pretty much.

If so, I am wary of trying too hard to work around the fundamental limitations of DLLs (option 4) since that feels like just applying a band-aid within the library, but users don't actually get the right behavior if they override functions in a similar way themselves.

No, not quite.

Option 4 would properly fix the issue (as far as practically possible on the platform) - users can override operator delete(void*) in their code, they can call/emit calls to operator delete(void*, size_t) and have that fall back to their own overridden code - within their own executable. The fix wouldn't be specific to libc++ but would fix the same issue for all users of libc++ that try to override operators.

It would also fix the preexisting issue of fallback between e.g. throw/nothrow and aligned/non-aligned operator new.

So it fixes the surprise that -fsized-deallocation brings, and takes libc++ to the same level of functionality as MS's own STL, which is the bar for what functionality you can have on this platform.

The only case it won't fix, is that the user's overridden operators won't be visible in other DLLs in the process. (And MS STL doesn't solve that case either.)

Instead, wouldn't we better off acknowledging that this doesn't work with DLLs, adjusting the test suite accordingly and ideally issuing a diagnostic and rejecting any user trying to do this?

Not really sure how we could emit a diagnostic for it, practically. And while some things are impossible to do correctly with DLLs, and there are preexisting cases of operator overloading that doesn't work as expected, this change (defaulting to sized deallocation) makes yet another pretty large regression in what works and what doesn't.

So after having thought about this for a few hours since I posted the options above...

I think that option 4 is the best and most correct, on a technical level. - It's also hugely Windows specific in how it works, so it would need to sit behind a bunch of conditionals in the cmake file, and I'm not sure how interested the rest of libc++ is in having such a kludge in there. There are a few technical details around the implementation of it (in my proof of concept so far), that maybe could be fixed better as well.

However practically, I don't think option 4 is ready to be merged before 19.x, and I wouldn't want to rush such a change.

So pragmatically, I would go with option 1 - with the plan that option 4 would be the goal, at some point, when the implementation is acceptable, and we'd punt on enabling -fsized-deallocation for this target until then.

alvinhochun commented 3 months ago

I haven't quite read the comments thoroughly, but if option 4 is the same way as how the MSVC STL deals with the issue, I would lean towards it. I agree there isn't nearly enough time to get this solution into 19.x, so I would also vote for option 1 - disabling -fsized-deallocation on MinGW for the time being.

cjacek commented 3 months ago

I agree that option 4 is best long term and 1 seems okay short term.

For the compatibility, AFAIK, the only way to make merging work with binutils ar are MRI scripts. llvm-lib (and MSVC lib.exe) does what we need here by default.

mstorsjo commented 3 months ago

For the compatibility, AFAIK, the only way to make merging work with binutils ar are MRI scripts. llvm-lib (and MSVC lib.exe) does what we need here by default.

Hmm - here I do the merging with the llvm-ar option L - but do you mean that lib.exe and llvm-lib also do that operation, if you just add an existing static archive to another one (instead of adding the static archive as one member)?

FWIW, one thing I'd still like to improve in the PoC for option 4, is that currently, the import lib first contains the symbols for operator new and delete, as they are exported from the DLL (and we still want them to be exported - older executables link against them), but we also add those symbols as regular objects into the merged library, so we have multiple archive members providing the same symbol. This feels borderline underspecified - or I guess it might be well defined, that the linker picks whichever member comes first?

So ideally, I'd like to omit that particular symbol from the import library, but keep it exported. So either an operation on the import library archive to remove those (short import) files before merging, or to have the linking operation produce a def file, remove those symbols from the def file and regenerate an import library with e.g. llvm-dlltool (or llvm-lib). The former seems more straightforward, but I'm not entirely sure how, while the latter seems quite roundabout.

ldionne commented 3 months ago

@mstorsjo Ack, thanks for all the information here.

I think I am rallying behind the idea that 4 is the best long term, and I guess 1 is the only viable option in the short term.

Maintainability-wise, I believe that refactorinh libcxxabi into libcxx like we’ve wanted to do for a long time might be a helpful first step since it would clean up the build system a lot. I might have time to put together a RFC for that in the coming weeks.

mstorsjo commented 3 months ago

Maintainability-wise, I believe that refactorinh libcxxabi into libcxx like we’ve wanted to do for a long time might be a helpful first step since it would clean up the build system a lot. I might have time to put together a RFC for that in the coming weeks.

FWIW, that doesn't make much of a difference either way for this fix, but I don't mind such a change - for the mingw builds we essentially treat libcxxabi as an integrated part of libcxx anyway.

mati865 commented 3 months ago

I agree with the others. Since the work on option 4 is already advanced, it'll be probably best to buy it some time by going option 1 for the upcoming release. It's not clear to me how beneficial sized deallocation is on Windows, but given MS is already doing it, they probably found it worthwhile, even with DLL shenanigans.

iirc GNU ar doesn't implement those options

This is strictly about libc++ building, right? I wouldn't worry about libc++ compilation too much, at MSYS2 we have stopped trying to build libc++ with GNU tools and use LLVM toolset for that task because the issues were too severe. It's hard to tell how many users care about using libc++ with Binutils, but I wouldn't consider it very popular anyway.

mstorsjo commented 3 months ago

It's not clear to me how beneficial sized deallocation is on Windows, but given MS is already doing it, they probably found it worthwhile, even with DLL shenanigans.

I'm not sure that the sized deallocation itself is beneficial in any way; the way I understand it is just that C++14 says that when you invoke delete it is supposed to invoke this new sized delete operator; this affects things like which of the user's provided operators that gets called.

iirc GNU ar doesn't implement those options

This is strictly about libc++ building, right?

Yes, that's true.

I wouldn't worry about libc++ compilation too much, at MSYS2 we have stopped trying to build libc++ with GNU tools and use LLVM toolset for that task because the issues were too severe. It's hard to tell how many users care about using libc++ with Binutils, but I wouldn't consider it very popular anyway.

Yeah it's probably not an important case. But we probably should either skip doing that library merging in that case, or error out cleanly or something.

mstorsjo commented 3 months ago

I made a PR for option 1, reverting to the older behaviour for the MinGW target now, in https://github.com/llvm/llvm-project/pull/97232.

ldionne commented 3 weeks ago

@mstorsjo Any development for a long term fix here?

mstorsjo commented 3 weeks ago

@mstorsjo Any development for a long term fix here?

No specific progress lately. I did make a quite decent PoC earlier, but I haven’t picked it up to finish it for review yet.

llvm / llvm-project

Libc++ breakage with -fsized-deallocation on MinGW #96899