rapidsai / rmm

RAPIDS Memory Manager
https://docs.rapids.ai/api/rmm/stable/
Apache License 2.0
478 stars 195 forks source link

[BUG] Cannot catch RMM exceptions thrown across DSO boundaries #1652

Closed wence- closed 1 month ago

wence- commented 1 month ago

Describe the bug

Since #1644, the default symbol visibility for RMM is hidden.

This is a good thing, however, the stick was applied a bit too aggressively.

In particular: exception classes need to have default visibility if we want to be able to catch an error thrown from DSO-A in DSO-B (otherwise the linker won't have disambiguated the names and the typeinfos will not compare equal).

I'm also worried that we probably need to mark all public non-template classes as RMM_EXPORT as well, since again, the two classes will not be the same (though will at least be ABI compatible) when passed between DSOs.

wence- commented 1 month ago

See https://www.akkadia.org/drepper/dsohowto.pdf page 22 and https://gcc.gnu.org/wiki/Visibility for some discussion of the issue with exceptions

jameslamb commented 1 month ago

Thanks for opening this.

First, we should clarify what's meany by "for RMM". #1644 only changed the visibility of symbols for the Cython bits of the rmm Python package. It shouldn't have in any way affected the use of RMM as a header-only library.

I've seen those resources you'd linked, but I was assuming that within the Python package, there wouldn't be cross-DSO throwing of anything not already marked for export via Cython's mechanisms.

Could you share a reproducible example of a case that used to work and is now broken?

Full disclosure, I did #1644 but this is way at the edge of my C++ understanding, so hopefully @robertmaynard will be able to help as well.

wence- commented 1 month ago

libcudf #includes rmm headers and, in some circumstances will throw an exception. This will now have a different typeinfo from the exception that is in the Cython DSO, since the two are compiled with different visibility and visibility-hidden wins out. So I think you won't be able to catch it. I don't right now have a 24.10 to test with though.

robertmaynard commented 1 month ago

It looks like the converstation we had #1644 wasn't sufficient ( https://github.com/rapidsai/rmm/pull/1644#issuecomment-2286236030 ) .

Types such as rmm::cuda_error that inherit from std::runtime_error have vtables and therefore need to be marked as RMM_EXPORT so that you can throw one in CythonDSO_A and catch it in CythonDSO_B. Likewise the host_memory_resource, device_memory_resource and all derived types would also need to be marked as RMM_EXPORT if we expect them to cross DSO boundaries.