rapidsai / rmm

RAPIDS Memory Manager
https://docs.rapids.ai/api/rmm/stable/
Apache License 2.0
472 stars 192 forks source link

[BUG] thrust version of `cuda_async_memory_resource` constructor is broken #1530

Closed willkill07 closed 5 months ago

willkill07 commented 5 months ago

Describe the bug

Observed in 24.04.00 but still exists on branch-24.06.

Compilation failure with cuda_async_memory_resource. The thrust::optional variant of the constructor to cuda_async_memory_resource includes code which cannot be compiled.

Essentially, this means the thrust::optional versioned constructor, if ever invoked, would always cause compilation failure. Unfortunately, this also affects static analysis tools such as clang-tidy in a way that will always yield failure, making this bug likely more severe than originally observed.

The actual bug is the code that states .value_or(std::nullopt) in the delegating constructor.

thrust::optional<T>::value_or expects a value which will always yield T. Passing std::nullopt is incompatible.

Steps/Code to reproduce bug

Option 1

Attempt to instantiate a new instance of cuda_async_memory_resource with the thrust::optional constructor

Option 2

Invoke clang-tidy on a file which happens to include <rmm/mr/device/cuda_async_memory_resource.hpp>

You end up with a compilation error diagnostic similar to:

/home/willk/envs/theseus_dev/include/libcudf/rapids/thrust/optional.h:1726:5: error: static assertion failed due to requirement 'std::is_convertible<const std::nullopt_t &, rmm::mr::cuda_async_memory_resource::allocation_handle_type>::value': T must be copy constructible and convertible from U [clang-diagnostic-error]
    static_assert(std::is_copy_constructible<T>::value &&
    ^
/home/willk/envs/theseus_dev/include/rmm/mr/device/cuda_async_memory_resource.hpp:100:53: note: in instantiation of function template specialization 'thrust::optional<rmm::mr::cuda_async_memory_resource::allocation_handle_type>::value_or<const std::nullopt_t &>' requested here
                                 export_handle_type.value_or(std::nullopt))
                                                    ^
/home/willk/envs/theseus_dev/include/libcudf/rapids/thrust/optional.h:1729:35: error: cannot convert 'const std::nullopt_t' to 'rmm::mr::cuda_async_memory_resource::allocation_handle_type' without a conversion operator [clang-diagnostic-error]
    return has_value() ? **this : static_cast<T>(std::forward<U>(u));
                                  ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Expected behavior

I expect the code to compile.

Potential fix

Ultimately, a conversion from thrust::optional -> std::optional must be performed. The monadic operations on thrust::optional don't easily accomplish this, so falling back to a simple ternary is sufficient.

thrust::optional<T> thrust_opt = ...;
std::optional<T> std_opt{thrust_opt.has_value() ? std::make_optional(*thrust_opt) : std::nullopt};

Environment details (please complete the following information):

Additional context None

harrism commented 5 months ago

Hi @willkill07 . Thanks for this report and apologies for the trouble. We can't fix this in 24.04 because we just completed the release for that branch. And we don't need to fix it in 24.06 because this function was deprecated in 24.04 so can just be removed in 24.06.

harrism commented 5 months ago

As a workaround if you are dependent on 24.04, use only the std::optional ctor. CC @miscco

willkill07 commented 5 months ago

If it won't be fixed in 24.04 then we will have to skip 24.04 entirely. We depend on clang-tidy and this constructor breaks it outright.

Anyone who uses 24.04 with clang or clang-tidy will be forced to skip it as well.

willkill07 commented 5 months ago

I don't think you understand the severity. We are using the optional version of the constructor but clang-tidy encounters a compilation failure as soon as it parses the problematic constructor.

willkill07 commented 5 months ago

Actually, I'd like to amend my initial report. It breaks as soon as you include the file and attempt to use clang as the compiler. "Equivalent" reproducer: https://godbolt.org/z/bqKvddaE6

harrism commented 5 months ago

@willkill07 I'm really sorry for the trouble. Bugs happen. 😞

We would like to avoid a hotfix in RAPIDS, especially at the bottom of the stack. We have documented criteria for hotfixes here: https://docs.rapids.ai/releases/hotfix/

Applying those, I came up with this:

  1. Is this bug significant? Perhaps (fails to compile in some situations)
  2. Does it affect the majority of users? No*
  3. Is there a reasonable workaround? Yes**

*I think the vast majority compile with GCC, or we would have caught this sooner. RAPIDS developers use clangd and clang-tidy, but CI does not depend on them. **As a workaround, you could patch (delete that constructor) when you vendor RMM. FWIW, RAPIDS regularly applies patches (for example) to Thrust and other libraries.

Please let us know if the criteria evaluate differently for you or if this seems unreasonable.

willkill07 commented 5 months ago

We have patched the constructor in our internal feedstock.

Talking to other folks internally at Voltron Data, we would be more than happy to do testing before releases are made to help the burndown process for bi-monthly releases (and be better users).

Also, would you be opposed to having a contribution made for RMM such that clang-tidy could live in CI?

harrism commented 4 months ago

I would definitely not be opposed to clang-tidy. I already set up configuration for it years ago, and I run it while developing.