Open tanzby opened 1 month ago
@stotko help to review this PR, thanks
Attention: Patch coverage is 0%
with 4 lines
in your changes missing coverage. Please review.
Project coverage is 97.19%. Comparing base (
3b7d712
) to head (87fe7d7
).
Files | Patch % | Lines |
---|---|---|
src/stdgpu/impl/unordered_map_detail.cuh | 0.00% | 2 Missing :warning: |
src/stdgpu/impl/unordered_set_detail.cuh | 0.00% | 2 Missing :warning: |
:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.
Add
bool occupied(index n)
function for unordered_set and unordered_map. So that we can:struct AllocateNewBlocks { AllocateNewBlocks(stdgpu::unorderd_map<xx> block_map, BlockBuffer buffer_buffer, stdgpu::unorderd_set<xx> not_exist_block_indices) : buffer_buffer(buffer_buffer), block_map(block_map), not_exist_block_indices(not_exist_block_indices) {} __device__ void operator()(const stdgpu::index_t index) { if (!not_exist_block_indices.occupied(index)) { // Used here. return; } const BlockIndex block_index = *(not_exist_block_indices.begin() + index); if (const auto& [iter, is_inserted] = block_map.emplace(block_index, 0); is_inserted) { iter->second = buffer_view.AllocateBlock(); } } BlockBuffer buffer_buffer; stdgpu::unorderd_map<xx> block_map; stdgpu::unorderd_set<xx> not_exist_block_indices; }; stdgpu::for_each_index(thrust::cuda::par.on(stream()), not_exist_block_indices().max_size(), AllocateNewBlocks(block_map, block_buffer, not_exist_block_indices);
It skips the need for getting
device_range
.
Thanks for working on this. However, I believe that exposing the occupied
function is not the right way to move forward since this function really meant as an implementation detail of the base container. Even exposing begin()
, for symmetry with end()
(required for find()
), already gives more access to the internals than typically needed.
While the use case you mentioned is fine, it may suffer from bad performance since the load factor of unordered_map
/unordered_set
is typically low and thus the container is only sparsely filled, which would lead to a low thread utilization in your kernel where many threads of a warp immediately return in the if statement. That is the reason for having device_range()
as it allows to densely pack all occupied values.
As mentioned in #423, adding (host-only) overloads with an additional stream argument for the load()
and store()
function of atomic
would be better as it addresses the actual underlying problem. More concretely, I see two options here:
transform_reduce_index
to let thrust do the work for us: More like a workaround and might be inefficient in terms of performance.
Add
bool occupied(index n)
function for unordered_set and unordered_map. So that we can:It skips the need for getting
device_range
.