owensgroup / BGHT

BGHT: High-performance static GPU hash tables.
https://owensgroup.github.io/BGHT/
Apache License 2.0
55 stars 8 forks source link

Discussion on the parallelism of cooperative_group #17

Closed ninixing closed 1 year ago

ninixing commented 1 year ago

Hello, I have a confusion about the code implementation. We use the GPU to take advantage of its parallelism to speed up the find or insert process.

Take the find operation as an example, cooperative_group is used in implementation. A key is searched in parallel using 16 threads, and 16 keys are actually searched serially in the tile. So I think using tiles to find a key is a waste of GPU parallel resources: in fact, the process of finding a key by a single thread will not be very slow, and when we use 16 threads to find a key, only 1 thread is used, and the other 15 threads doing the same thing.

Therefore, from my point of view, I think that if each thread is designed to look up different keys, this can make full use of the parallel characteristics of the GPU and better speed up the lookup operation. What do you think?

Looking forward to your reply, thank you in advance!

maawad commented 1 year ago

Hi @ninixing, It is better to think about the memory access patterns to understand our design decision here. To fully utilize the DRAM bandwidth, we have to read data from global memory in a coalesced manner; and in the case of $b=16$ and 8-byte key-value pair, that means we are reading an entire cache line from the DRAM every time we read a bucket. Optimizing memory access patterns is the primary motivation for using the per-thread assignment per-tile processing approach.

Please also see my comment and the discussion in this issue.

I hope this clarifies things. Let us know if you have more questions!

ninixing commented 1 year ago

Thanks very much for your reply. So with the tile design, each thread reads one slot from a bucket, and 16 threads can read the bucket. During find/insert operations, as far as I understand now,

  1. For 16 threads of the tile, each thread calculates the location of key_x to be queried (in fact, the 16 threads do the same thing)
  2. 16 threads read their corresponding slots (implements coalesced memory access)
  3. 16 threads compare the key_x with corresponding slot, and return the result (comparison of 16 thread is parallel)
  4. result is broadcast to 15 threads, and loop to next query After that, according to this process(1-4), the execution of remaining 15 key queries to be performed one by one.

Is my understanding correct? Please feel free to point out if there is any problem. Thank you!

maawad commented 1 year ago

Yes, your understanding is correct. I should add that for insertion, step 3 becomes:

  1. 16 threads compare the sentinel key to the key in their slot. A sentinel key indicates that the slot is empty, and by making that comparison, we find the offset of the next empty location in the bucket. Once we find the empty location, one of the 16 threads atomically sets (i.e., insert) the new key-value pair.

If the atomic operation fails, we try again. If the bucket is full, then we follow a probing scheme to determine the next bucket we should attempt insertion in. The probing discussion is a longer one, but I can just refer you to specific sections in the paper if you're interested.