I wanted to get a better understanding of the argument --num_ranks_in_servers in the image classification runtime referenced here. I had the following questions:
Should it be equal to the number of GPUs per node?
In the readme for runtime, I do not see any mention of this argument in the instructions to run the framework. From my understanding, assigning the correct value for this argument is important to enable Peer2Peer communication among GPUs using gloo. Otherwise, pipedream switches to a suboptimal communication routine by yanking the data off to the CPU and then sending it via gloo. Please let me know if I should be using this argument to operate the framework in it's most optimal format. If yes, I'd suggest an update of the readme file.
I wanted to get a better understanding of the argument
--num_ranks_in_servers
in the image classification runtime referenced here. I had the following questions: