owensgroup / BGHT

BGHT: High-performance static GPU hash tables.
https://owensgroup.github.io/BGHT/
Apache License 2.0
55 stars 8 forks source link

Why the kernel of BCHT implemented as "tiled_find_kernel" ? #4

Closed alphaRGB closed 2 years ago

alphaRGB commented 2 years ago

Hello, I have found that there are two kinds find kernels:

__global__ void tiled_find_kernel(InputIt first,
                                  InputIt last,
                                  OutputIt output_begin,
                                  HashMap map) 
__global__ void find_kernel(InputIt first,
                            InputIt last,
                            OutputIt output_begin,
                            HashMap map) 

this Kernel find_kernel is quit easy to understand, but tiled_find_kernel seams little hard to understand. The bucketed cuckoo hash use tiled_find_kernel, I am fused why BCHT needs tile, what's the meaning of these code of tiled_find_kernel ? Could you please help me explain the meaning of it? Thank you in advance


  auto work_queue = tile.ballot(do_op);
  while (work_queue) {
    auto cur_rank = __ffs(work_queue) - 1;
    auto cur_key = tile.shfl(find_key, cur_rank);

    typename HashMap::mapped_type find_result = map.find(cur_key, tile);

    if (tile.thread_rank() == cur_rank) {
      result = find_result;
      do_op = false;
    }
    work_queue = tile.ballot(do_op);
  }
maawad commented 2 years ago

Hi @alphaRGB, Bucketed cuckoo hash table (BCHT) insertion and find operations are performed in a tile-wide fashion. The two main goals are 1) achieving a coalesced memory access when reading the bucket, 2) avoiding branch divergence between threads. The code snippet that you showed builds a queue (within a tile) of find operations then performs each operation one at a time.

  auto work_queue = tile.ballot(do_op);  // build a queue bitmap of all threads where `do_op = true`
  while (work_queue) { // while there are items in the queue
    auto cur_rank = __ffs(work_queue) - 1; // find the first thread rank in the queue
    auto cur_key = tile.shfl(find_key, cur_rank);  // all tiled bucketed hash table operations expect the same key for all threads in  a tile
    typename HashMap::mapped_type find_result = map.find(cur_key, tile); // every thread in the tile calls find

    if (tile.thread_rank() == cur_rank) { 
      result = find_result; //  thread (in the tile) result
      do_op = false; // pop operation
    }
    work_queue = tile.ballot(do_op); // rebuild the bitmap queue
  }

Here is where we read a bucket from memory and where the coalesced memory access is ( the tile size is the same as bucket_size).

The same idea is used for all bucketed hash tables. find_kernel and insert_kernel are only used by 1CHT (cuckoo hash table with a bucket size of one). Coalescing memory accesses is one of the main reasons why BCHT is much faster than 1CHT.

Below is a picture that also explains the code snippet for insertion (same for find) (see slides 36-42 for explanation with pictures): image

Let me know if you have more questions!