Improve solution for CUDA C

The current solutions use 1D threads to calculate distance matrix and prevent redundancy works but a bit hard coding. It makes code more complex compared with sequential version. It is not good to beginners which might make people confused. It is a multidimensional computational problem. Mapping nested for-loop to X and Y axis of CUDA thread index can simply solve the problem. I suggest to keep solutions simple because the readers are beginners.
Also, proposal is faster than current solutions. The proposal only invokes 1 kernel, but current solutions invoke kernel multiple times.

current solution	proposal
19.6 ms	17.8ms

current solution	proposal
20.9 ms	19.9ms

Test device: Tesla V100

Unify indent as 4 spaces. In original version, some lines use tab, some lines use spaces. It is difficult to read.

openhackathons-org / gpubootcamp