rapidsai / distributed-join

Other
19 stars 12 forks source link

Compile and run script updates #11

Closed nsakharnykh closed 4 years ago

nsakharnykh commented 4 years ago

Updated Makefile to remove env var - must be set outside the makefile now Updated run_sample.sh to print rank/GPU/CPU/HCA assignments, default selecting OMPI rank ID, default using distributed join test Posted the command line and the expected results in README for DGX-1V

gaohao95 commented 4 years ago

@nsakharnykh The changes look good to me. You can merge it as you see fit.

In our implementation, MPI is only used for exchanging UCX addresses. The communication in the join is handled directly by UCX APIs. Why changing MPI MCAs can have performance impact?

nsakharnykh commented 4 years ago

It's just that lrank in the script is captured incorrectly with SLURM_LOCALID if we run the script using mpirun instead (which is what I was doing primarily, and posted instructions in README). I just made the MPI rank ID default, and added the print, so we can see the selected assignment from the script right away. If we run with SLURM srun, we should be using SLURM_LOCALID - probably need to make sure it's clarified in README.