stotko / stdgpu

stdgpu: Efficient STL-like Data Structures on the GPU
https://stotko.github.io/stdgpu/
Apache License 2.0
1.15k stars 81 forks source link

add `occupied(n)` for unordered set and map #427

Open tanzby opened 1 month ago

tanzby commented 1 month ago

Add bool occupied(index n) function for unordered_set and unordered_map. So that we can:

struct AllocateNewBlocks {
  AllocateNewBlocks(stdgpu::unorderd_map<xx> block_map,
                    BlockBuffer buffer_buffer,
                    stdgpu::unorderd_set<xx> not_exist_block_indices)
      : buffer_buffer(buffer_buffer),
        block_map(block_map),
        not_exist_block_indices(not_exist_block_indices) {}

  __device__ void operator()(const stdgpu::index_t index) {
    if (!not_exist_block_indices.occupied(index)) { // Used here.
      return;
    }
    const BlockIndex block_index = *(not_exist_block_indices.begin() + index);
    if (const auto& [iter, is_inserted] = block_map.emplace(block_index, 0); is_inserted) {
      iter->second = buffer_view.AllocateBlock();
    }
  }

  BlockBuffer buffer_buffer;
  stdgpu::unorderd_map<xx> block_map;
  stdgpu::unorderd_set<xx> not_exist_block_indices;
}; 

stdgpu::for_each_index(thrust::cuda::par.on(stream()),
                       not_exist_block_indices().max_size(),
                       AllocateNewBlocks(block_map, block_buffer, not_exist_block_indices);

It skips the need for getting device_range.

tanzby commented 1 month ago

@stotko help to review this PR, thanks

codecov[bot] commented 1 month ago

Codecov Report

Attention: Patch coverage is 0% with 4 lines in your changes missing coverage. Please review.

Project coverage is 97.19%. Comparing base (3b7d712) to head (87fe7d7).

Files Patch % Lines
src/stdgpu/impl/unordered_map_detail.cuh 0.00% 2 Missing :warning:
src/stdgpu/impl/unordered_set_detail.cuh 0.00% 2 Missing :warning:
Additional details and impacted files ```diff @@ Coverage Diff @@ ## master #427 +/- ## ========================================== - Coverage 97.34% 97.19% -0.16% ========================================== Files 32 32 Lines 2524 2528 +4 ========================================== Hits 2457 2457 - Misses 67 71 +4 ```

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

stotko commented 1 month ago

Add bool occupied(index n) function for unordered_set and unordered_map. So that we can:

struct AllocateNewBlocks {
  AllocateNewBlocks(stdgpu::unorderd_map<xx> block_map,
                    BlockBuffer buffer_buffer,
                    stdgpu::unorderd_set<xx> not_exist_block_indices)
      : buffer_buffer(buffer_buffer),
        block_map(block_map),
        not_exist_block_indices(not_exist_block_indices) {}

  __device__ void operator()(const stdgpu::index_t index) {
    if (!not_exist_block_indices.occupied(index)) { // Used here.
      return;
    }
    const BlockIndex block_index = *(not_exist_block_indices.begin() + index);
    if (const auto& [iter, is_inserted] = block_map.emplace(block_index, 0); is_inserted) {
      iter->second = buffer_view.AllocateBlock();
    }
  }

  BlockBuffer buffer_buffer;
  stdgpu::unorderd_map<xx> block_map;
  stdgpu::unorderd_set<xx> not_exist_block_indices;
}; 

stdgpu::for_each_index(thrust::cuda::par.on(stream()),
                       not_exist_block_indices().max_size(),
                       AllocateNewBlocks(block_map, block_buffer, not_exist_block_indices);

It skips the need for getting device_range.

Thanks for working on this. However, I believe that exposing the occupied function is not the right way to move forward since this function really meant as an implementation detail of the base container. Even exposing begin(), for symmetry with end() (required for find()), already gives more access to the internals than typically needed.

While the use case you mentioned is fine, it may suffer from bad performance since the load factor of unordered_map/unordered_set is typically low and thus the container is only sparsely filled, which would lead to a low thread utilization in your kernel where many threads of a warp immediately return in the if statement. That is the reason for having device_range() as it allows to densely pack all occupied values.

As mentioned in #423, adding (host-only) overloads with an additional stream argument for the load() and store() function of atomic would be better as it addresses the actual underlying problem. More concretely, I see two options here:

  1. Also implement a stream-aware host-to-device memcpy function: Clean, but not directly straightforward to do as the stream is a template class and the internal memory management system is intentionally strongly decoupled from the rest of the library.
  2. Simulate the memcpy with a "no-op" transform_reduce_index to let thrust do the work for us: More like a workaround and might be inefficient in terms of performance.