stotko / stdgpu

stdgpu: Efficient STL-like Data Structures on the GPU
https://stotko.github.io/stdgpu/
Apache License 2.0
1.15k stars 81 forks source link

unordered map creation freezes async processes #350

Open trsh opened 1 year ago

trsh commented 1 year ago

Describe the bug unordered map creation freezes async processes

Steps to reproduce

runBuldKernel << < block_size_x, thread_size_x, 0, build_stream >> > (ng, object_size_ui);

// The line below would only complete when runBuldKernel is done
stdgpu::unordered_map<uint32_t, uint32_t> map = stdgpu::unordered_map<uint32_t, uint32_t>::createDeviceObject(8);

Expected behavior The map creation and memory allocation should complete right away, without waiting for runBuldKernel to complete

Actual behavior The map creation and memory allocation completes only after runBuldKernel is done

System (please complete the following information):

trsh commented 1 year ago
runBuldKernel << < block_size_x, thread_size_x, 0, build_stream >> > (ng, object_size_ui); 

printf("1\n");
//stdgpu::unordered_map<uint32_t, uint32_t> map = stdgpu::unordered_map<uint32_t, uint32_t>::createDeviceObject(8);
Pointer* p;
cudaMalloc(&p, 1 * sizeof(Pointer));
printf("2\n")

This in contrast works in async. Allocation happens without waiting for runBuldKernel to complete

stotko commented 1 year ago

This is a known limitation. Although the required parallel algorithms from thrust used in stdgpu as well as the intermediate interface in stdgpu all support arbitrary execution_policys (where a CUDA stream can be encapsulated), most functions (which also includes stdgpu::unordered_map::createDeviceObject) fall back to the default stream. Thus, the behavior you observe primarily comes from how the default stream is handled in CUDA, which by default is "legacy" behavior and forces synchronization.

I think adding explicit support for asynchronous streams would be a good enhancement. Until this feature lands in stdgpu, as a workaround you could possibly 1. move the creation of the map to an earlier stage if this is possible, or 2. enable "per-thread" behavior for the default stream which can be set with the --default-stream compiler flag.

trsh commented 1 year ago

1 is not possible. And I am not sure what 2 does, need to read about it, so it doesn't brake something else.

stotko commented 1 year ago

For reference, #351 tracks all affected functions which currently do not have proper support for custom execution_policys such as thrust::device.on(stream).

trsh commented 1 year ago

@stotko doesn't seem like default stream is the issue. This below works in async..

runBuldKernel << < block_size_x, thread_size_x >> > (ng, object_size_ui);
printf("1\n");
//stdgpu::unordered_map<uint32_t, uint32_t> map = stdgpu::unordered_map<uint32_t, uint32_t>::createDeviceObject(8);
Pointer* p;
cudaMalloc(&p, 1 * sizeof(Pointer));
printf("2\n");

cudaMalloc and printf("2\n") runs right away, without waiting for runBuldKernel (made it infinite one) to finish. As I did not specify stream, this all goes to default one.

When I uncomment the map part, its blocked. No matter what comes after it.

stotko commented 1 year ago

Thanks for further testing. I still believe that the issue is related to the default stream. Just to make sure, could be try calling another kernel on the default stream (could be anything), while runBuldKernel uses build_stream as done before, so that you have the same setup described in the legacy default stream section.

In contrast to a pure cudaMalloc which does not block, stdgpu::unordered_map::createDeviceObject additionally also has to initialize its internal arrays which is done by calling kernels on the default stream.

trsh commented 1 year ago
runBuldKernel << < block_size_x, thread_size_x, 0, build_stream >> > (ng, object_size_ui);
printf("1\n");
k_2 << <1, 1 >> > ();
printf("2\n");
stdgpu::unordered_map<uint32_t, uint32_t> map = stdgpu::unordered_map<uint32_t, uint32_t>::createDeviceObject(8);
printf("3\n");

k_2 is executed without waiting. An then it blocks in map creation, i.e. 2 is printed

stotko commented 1 year ago

I have reproduced your observations. In fact, there are two issues:

  1. The legacy default stream forces a strict order on the execution of the kernels and performs implicit (weak) synchronization of the involved scheduled kernels but leaves the CPU asynchronous, hence the non-blocking CPU printf statements.
  2. thrust's algorithms are all synchronous since CUDA 10.1 unless the thrust::async versions are used. More recent versions, i.e. thrust 1.16 and newer, introduced the asynchronous policy thrust::cuda::par_nosync.on(stream) which would make the called (by default synchronous) algorithms behave like custom CUDA kernels without CPU blocking.

In that sense, you are right that my initial explanation was not sufficient. Fortunately, adding support for custom execution_policys would still resolve the issue as above policy could be used to force the correct behavior on thrust's side.

trsh commented 1 year ago

So there is currently no solution to make this happen in async?

stotko commented 1 year ago

If you are only concerned about the CPU blocking part and the stream ordering behavior is acceptable, then a workaround could be to create the unordered_map object in a different CPU thread, for instance using std::async. Then, the creation would block in the newly created thread while the main thread would continue normally.