stotko / stdgpu

stdgpu: Efficient STL-like Data Structures on the GPU
https://stotko.github.io/stdgpu/
Apache License 2.0
1.15k stars 81 forks source link

unordered_map creation crashing with cuda 12.x #338

Closed trsh closed 1 year ago

trsh commented 1 year ago

Describe the bug

unordered_map creation crashes with cuda 12.x (testing with release build)

Steps to reproduce In cu main()

stdgpu::unordered_map<int, int> map = stdgpu::unordered_map<int, int>::createDeviceObject(1);

Expected behavior Abvious

Actual behavior

Exception thrown at 0x00007FFD29D140AC in CudaHelloWorld.exe: Microsoft C++ exception: thrust::system::system_error at memory location 0x0000009373EFF210.
Unhandled exception at 0x00007FFD29D140AC in CudaHelloWorld.exe: Microsoft C++ exception: thrust::system::system_error at memory location 0x0000009373EFF210.

System (please complete the following information):

stotko commented 1 year ago

This looks like an issue related to some low-level concepts according to https://nvidia.github.io/thrust/api/classes/classthrust_1_1system_1_1system__error.html

Could you try this code

try
{
    stdgpu::unordered_map<int, int> map = stdgpu::unordered_map<int, int>::createDeviceObject(1);
}
catch(thrust::system_error e)
{
    std::cerr << "Error: " << e.what() << std::endl;
}

and report what exact error is thrown by thrust?

trsh commented 1 year ago

This looks like an issue related to some low-level concepts according to https://nvidia.github.io/thrust/api/classes/classthrust_1_1system_1_1system__error.html

Could you try this code

try
{
    stdgpu::unordered_map<int, int> map = stdgpu::unordered_map<int, int>::createDeviceObject(1);
}
catch(thrust::system_error e)
{
    std::cerr << "Error: " << e.what() << std::endl;
}

and report what exact error is thrown by thrust?

Error: after reduction step 2: cudaErrorInvalidDeviceFunction: invalid device function

Mark I do other thrust and cuda stuff with no problems. GPU is rtx3060

trsh commented 1 year ago

There are some warnings in output:

1>libs/stdgpu/include\stdgpu/bitset.cuh(85): warning #20012-D: __host__ annotation is ignored on a function("reference") that is explicitly defaulted on its first declaration
1>
1>Remark: The warnings can be suppressed with "-diag-suppress <warning-number>"
1>
1>libs/stdgpu/include\stdgpu/bitset.cuh(85): warning #20012-D: __device__ annotation is ignored on a function("reference") that is explicitly defaulted on its first declaration
1>
1>libs/stdgpu/include\stdgpu/bitset.cuh(85): warning #20012-D: __host__ annotation is ignored on a function("reference") that is explicitly defaulted on its first declaration
1>
1>Remark: The warnings can be suppressed with "-diag-suppress <warning-number>"
1>
1>libs/stdgpu/include\stdgpu/bitset.cuh(85): warning #20012-D: __device__ annotation is ignored on a function("reference") that is explicitly defaulted on its first declaration
stotko commented 1 year ago

Thanks for the pointers.

The warnings you observe are related to a different issue where NVCC reports that STDGPU_HOST_DEVICE, which resolves to a __host__ __device__ annotation in the CUDA backend, is not required for a the copy constructor of bitset as it is marked = default;. This can be safely ignored for now.

The cudaErrorInvalidDeviceFunction error indicates that the device code seems to be not correctly compiled for some reason. Could you share your CMakeLists.txt file which contains the calls to find/include stdgpu as well as on how your executable is set up? In particular, it would be interesting to see how the compute capability is set (should be something like set(CMAKE_CUDA_ARCHITECTURES 86) for the NVIDIA RTX 3060 GPU). Furthermore, could you also provide the log by CMake when the project is configured, i.e. when calling cmake .. within the build directory?

trsh commented 1 year ago

Thanks for the pointers.

The warnings you observe are related to a different issue where NVCC reports that STDGPU_HOST_DEVICE, which resolves to a __host__ __device__ annotation in the CUDA backend, is not required for a the copy constructor of bitset as it is marked = default;. This can be safely ignored for now.

The cudaErrorInvalidDeviceFunction error indicates that the device code seems to be not correctly compiled for some reason. Could you share your CMakeLists.txt file which contains the calls to find/include stdgpu as well as on how your executable is set up? In particular, it would be interesting to see how the compute capability is set (should be something like set(CMAKE_CUDA_ARCHITECTURES 86) for the NVIDIA RTX 3060 GPU). Furthermore, could you also provide the log by CMake when the project is configured, i.e. when calling cmake .. within the build directory?

I built 'stdgpu' by running cmake first to generate a Visual Studio solution (I did not use the sh scripts, since I am on windows). Used all default flags. Opened the solution file, picked release and built the solution. Then I run the cmake install command, to extract .lib and necessary includes. Then I added all of these to my cuda project, linked the lib and added additional includes. I am not good with cmake, so I won't be able to do elaborate much without precise instructions.

This is the log from cmake configuration https://pastebin.com/x78Kf67J

stotko commented 1 year ago

Thanks. The CMake configuration of stdgpu looks reasonable and I do not see an immediate issue here. Just to double-check: Since the tests are also built, could you run the respective executable teststdgpu and check whether all tests pass or the same error is thrown? At the end, you should see a summary by the involved googletest library stating how many tests failed or passed.

In case that the unit tests work as expected, I guess that the issue is more likely related to the configuration of your Visual Studio project. As I am primarly working with CMake and Ubuntu, this could be hard for me to reproduce, especially without further knowledge about your configuration on Windows. Could you check what compiler flags in Visual Studio are used to compile the .cu files in your CUDA project? Is there a flag like --generate-code arch=compute_86,code=sm_86" (or a similar/shorter version) which tells NVCC to generate code for your GPU? If not, you could try adding this flag and check if the issue disappears.

trsh commented 1 year ago

Thanks. The CMake configuration of stdgpu looks reasonable and I do not see an immediate issue here. Just to double-check: Since the tests are also built, could you run the respective executable teststdgpu and check whether all tests pass or the same error is thrown? At the end, you should see a summary by the involved googletest library stating how many tests failed or passed.

In case that the unit tests work as expected, I guess that the issue is more likely related to the configuration of your Visual Studio project. As I am primarly working with CMake and Ubuntu, this could be hard for me to reproduce, especially without further knowledge about your configuration on Windows. Could you check what compiler flags in Visual Studio are used to compile the .cu files in your CUDA project? Is there a flag like --generate-code arch=compute_86,code=sm_86" (or a similar/shorter version) which tells NVCC to generate code for your GPU? If not, you could try adding this flag and check if the issue disappears.

I will run the tests once I get back to my work PC. My compute flags are compute_72,sm_72? Is 86 a requirement?

Edit

However compute_86,code=sm_86 did not make a difference. Looking up for the tests,

trsh commented 1 year ago

All tests very successful. This is starting to look very strange :)

stotko commented 1 year ago

Thanks, this sort of confirms that most likely something is wrong with the configuration in Visual Studio. However, I wonder why applying the flags for CC 8.6 did not results in any improvements. The previous flags were for building on Tegra and Jetson devices, see https://en.wikipedia.org/wiki/CUDA#GPUs_supported for a complete list of how compute capabilities (CC) map to GPUs. So replacing or adding the flags for the RTX 3060 should have made a difference. Without them, the generated executable will only work on the Tegra and Jetson devices and not on the RTX 3060.

trsh commented 1 year ago

I built stdgpu on compute_86,code=sm_86 and changed also my project on compute_86,code=sm_86. Copied again over lib and includes, but same. I have no more idea right now.

trsh commented 1 year ago

Tests work, but \stdgpu-master\stdgpu-master\examples\cuda\unordered_map.cu suffers from same issue in stdgpu solution. How is that?

Update:

Example works in debug mode. But not in release

trsh commented 1 year ago

@stotko can u make a fresh clone & built and test examples\cuda\unordered_map.cu in Release x64. I think it should be reproducible also on your system.

stotko commented 1 year ago

I was able to reproduce the issue on Windows. Ubuntu seems to be unaffected and works normally in all cases.

The issue is that Visual Studio is a multi-config generator, i.e. that can have both Debug and Release builds next to each other, which becomes problematic here when the build type is inconsistent. You can see in the CMake log (which you provided) that the Build type entry in the summary is empty. Since this seems to "match" the Debug configuration, the debug build worked fine on your machine. However, the Release configuration comes with distinct standard libraries to link against, etc. in comparison to the Debug mode and this mismatch or inconsistency seems to somehow cause the issue.

Could you try configuring CMake (within the build directory) with

cmake -DCMAKE_BUILD_TYPE=Release ..

and then build the generated Visual Studio solution in Release x64 mode? The error should then disappear and you should see the message The duplicate-free map of numbers contains 101 elements (101 expected) and the computed sums are (5050, 338350) ((5050, 338350) expected).

trsh commented 1 year ago

Yes seems like this did it. Feel free to close this if it feels appropriate.

stotko commented 1 year ago

Thanks for checking. Making this more robust on our side is not that trivial. One potential way would be to make the library header-only, like thrust, which however would require several major changes, see #97.

Closing this issue as besides future improvements to the robustness the issue has been resolved.