The current solutions use 1D threads to calculate distance matrix and prevent redundancy works but a bit hard coding. It makes code more complex compared with sequential version. It is not good to beginners which might make people confused. It is a multidimensional computational problem. Mapping nested for-loop to X and Y axis of CUDA thread index can simply solve the problem. I suggest to keep solutions simple because the readers are beginners.
Also, proposal is faster than current solutions. The proposal only invokes 1 kernel, but current solutions invoke kernel multiple times.
cudaMalloc version
current solution
proposal
19.6 ms
17.8ms
CUDA Unified Memory version
current solution
proposal
20.9 ms
19.9ms
Test device: Tesla V100
Unify indent as 4 spaces. In original version, some lines use tab, some lines use spaces. It is difficult to read.
cudaMalloc version
CUDA Unified Memory version
Test device: Tesla V100