sekelle / cornerstone-octree

Local and distributed octrees based on Morton codes with halo discovery and exchange with a 3D collision detection algorithm
Other
32 stars 4 forks source link

Issues with `thrust::device_vector` starting from CUDA 12.4 #28

Closed sekelle closed 21 hours ago

sekelle commented 3 months ago

Starting from CUDA 12.4, thrust::device_vector is implemented in a version and architecture dependent inline namespace, e.g. thrust::THRUST_200301_860_NS::device_vector.

The inline namespace depends on a compiler-defined macro, meaning that #include <thrust/device_vector.h> from a .cpp file compiled with a host compiler such as GCC or Clang will result in thrust::device_vector being instantiated in a different namespace compared to the instantiation in a .cu translation unit compiled with a CUDA compiler such as nvcc or nvc++. This will resulting in linking errors.

The short-term workaround is to set a preprocessor macro in host code that will result in the inline namespace matching the namespace that the CUDA compiler will generate, e.g. -DCMAKE_CXX_FLAGS=-D__CUDA_ARCH_LIST__=<sm_arch>, where <sm_arch> is the CUDA compute capability with a zero appended, so

-DCMAKE_CXX_FLAGS=-D__CUDA_ARCH_LIST__=800 # Ampere
-DCMAKE_CXX_FLAGS=-D__CUDA_ARCH_LIST__=860 # Ada lovelace
-DCMAKE_CXX_FLAGS=-D__CUDA_ARCH_LIST__=900 # Hopper

The proper fix to prevent this issue is to stop exposing thrust::device_vector in host code at all. For example by encapsulating it with the pimpl idiom. (Putting everything in .cu translation units would be a maintenance nightmare as the code has to support AMD and CPU-only builds as well.)

The current strategy to instantiate thrust::device_vector in host code is subject to some non-obvious constraints that would go away with a pimpl encapsulation:

thrust::device_vector<unsigned> d_neighborsCountGpu(localCountGpu);

Unter the hood, this will call cudaMalloc and launch a GPU kernel to default-initialize the elements. Due to the GPU kernel, this will not compile in .cpp files. That's why there's reallocateDevice with a C++ function prototype and an implementation in .cu.

thrust::device_vector<unsigned> d_neighborsCountGpu = std::vector<unsigned>(localCountGpu);

Calls cudaMalloc to allocate the device memory, then calls cudaMempcy to copy the host data to the GPU. No GPU kernels, only CUDA runtime API calls, so this works fine in .cpp.