openhackathons-org / gpubootcamp

This repository consists for gpu bootcamp material for HPC and AI
Apache License 2.0
524 stars 253 forks source link

Improve solution for CUDA C #36

Closed jeng1220 closed 3 years ago

jeng1220 commented 3 years ago
  1. The current solutions use 1D threads to calculate distance matrix and prevent redundancy works but a bit hard coding. It makes code more complex compared with sequential version. It is not good to beginners which might make people confused. It is a multidimensional computational problem. Mapping nested for-loop to X and Y axis of CUDA thread index can simply solve the problem. I suggest to keep solutions simple because the readers are beginners.
  2. Also, proposal is faster than current solutions. The proposal only invokes 1 kernel, but current solutions invoke kernel multiple times.

cudaMalloc version

current solution proposal
19.6 ms 17.8ms

CUDA Unified Memory version

current solution proposal
20.9 ms 19.9ms

Test device: Tesla V100

  1. Unify indent as 4 spaces. In original version, some lines use tab, some lines use spaces. It is difficult to read.