Starting from CUDA 12.4, thrust::device_vector is implemented in a version and architecture dependent inline namespace, e.g. thrust::THRUST_200301_860_NS::device_vector.
The inline namespace depends on a compiler-defined macro, meaning that #include <thrust/device_vector.h> from a .cpp file compiled with a host compiler such as GCC or Clang will result in thrust::device_vector being instantiated in a different namespace compared to the instantiation in a .cu translation unit compiled with a CUDA compiler such as nvcc or nvc++.
This will resulting in linking errors.
The short-term workaround is to set a preprocessor macro in host code that will result in the inline namespace matching the namespace that the CUDA compiler will generate, e.g. -DCMAKE_CXX_FLAGS=-D__CUDA_ARCH_LIST__=<sm_arch>, where
<sm_arch> is the CUDA compute capability with a zero appended, so
-DCMAKE_CXX_FLAGS=-D__CUDA_ARCH_LIST__=800 # Ampere
-DCMAKE_CXX_FLAGS=-D__CUDA_ARCH_LIST__=860 # Ada lovelace
-DCMAKE_CXX_FLAGS=-D__CUDA_ARCH_LIST__=900 # Hopper
The proper fix to prevent this issue is to stop exposing thrust::device_vector in host code at all. For example by encapsulating it with the pimpl idiom. (Putting everything in .cu translation units would be a maintenance nightmare as the code has to support AMD and CPU-only builds as well.)
The current strategy to instantiate thrust::device_vector in host code is subject to some non-obvious constraints that would go away with a pimpl encapsulation:
Unter the hood, this will call cudaMalloc and launch a GPU kernel to default-initialize the elements. Due to the
GPU kernel, this will not compile in .cpp files. That's why there's reallocateDevice with a C++ function prototype
and an implementation in .cu.
Calls cudaMalloc to allocate the device memory, then calls cudaMempcy to copy the host data to the GPU.
No GPU kernels, only CUDA runtime API calls, so this works fine in .cpp.
Starting from CUDA 12.4,
thrust::device_vector
is implemented in a version and architecture dependent inline namespace, e.g.thrust::THRUST_200301_860_NS::device_vector
.The inline namespace depends on a compiler-defined macro, meaning that
#include <thrust/device_vector.h>
from a.cpp
file compiled with a host compiler such as GCC or Clang will result inthrust::device_vector
being instantiated in a different namespace compared to the instantiation in a.cu
translation unit compiled with a CUDA compiler such as nvcc or nvc++. This will resulting in linking errors.The short-term workaround is to set a preprocessor macro in host code that will result in the inline namespace matching the namespace that the CUDA compiler will generate, e.g.
-DCMAKE_CXX_FLAGS=-D__CUDA_ARCH_LIST__=<sm_arch>
, where<sm_arch>
is the CUDA compute capability with a zero appended, soThe proper fix to prevent this issue is to stop exposing
thrust::device_vector
in host code at all. For example by encapsulating it with the pimpl idiom. (Putting everything in.cu
translation units would be a maintenance nightmare as the code has to support AMD and CPU-only builds as well.)The current strategy to instantiate
thrust::device_vector
in host code is subject to some non-obvious constraints that would go away with a pimpl encapsulation:Unter the hood, this will call
cudaMalloc
and launch a GPU kernel to default-initialize the elements. Due to the GPU kernel, this will not compile in.cpp
files. That's why there'sreallocateDevice
with a C++ function prototype and an implementation in.cu
.Calls
cudaMalloc
to allocate the device memory, then callscudaMempcy
to copy the host data to the GPU. No GPU kernels, only CUDA runtime API calls, so this works fine in.cpp
.