Closed aovladi closed 2 years ago
Preliminary results for a 1 node 8 GPUs environment, T5 trained for 20 epochs. For GossipGraD every GPU is treated as a separate node and gossiping is happening between 8 workers according to algorithm's strategy. We can see that GossipGraD has higher loss, which is understandable due to the gossiping nature of the algorithm. Dissemination topology seems to outperform Cube topology.
Sorry if you already explained this, but do we have a heuristic for when to use the CUBE
or DISSEMINATION
topology (assuming an even number of nodes)? Otherwise, it might be confusing to the user which one to choose.
@awgu Thanks for the comment. I will add a note based on what authors provide, but I would like to experiment with both of them first on a bigger scale and understand advantages and drawbacks of every topology myself. Then I can properly explain differences in the documentation. At this point, DISSEMINATION topology is seemed to be preferred due to variety of communication peers, i.e. node sends and receives gradients from and to different peers, when for CUBE topology every node send and receive to and from the same partner. Experiments in the paper are shown for DISSEMINATION topology only, if I remember correct, so I would like to see maybe for some settings CUBE would perform better. If this is the case, I will update docs accordingly
@aovladi In parallel, could we also work on a multi-node test setup where we can see the performance gain of gossip grad over all_reduce?
This PR introduces an implementation for GossipGraD - a gossip communication protocol for a large-scale training.
API consists of 3 things :
Topology
specifies which topology will be used as a base for gradient communication. EitherCUBE
orDISSEMINATION
. Current limitations:CUBE
doesn't support uneven number of nodes. This comes fromtorch.distributed.distributed_c10d.batch_isend_irecv
requirement for all processes to participate in a first communication. WithCUBE
+ uneven number of nodes one of nodes won't have a partner for the first communication, thus communication is impossible.GossipGraDState
stores state needed to perform GossipGraD algorithm within a communication hook.gossip_grad_hook
communication hook.Tests
test_gossip_grad_state_init
makes sure state initializes properly and all assertions raised, if neccessary.test_gossip_grad_communication_dissimination
testsDISSEMINATION
topology in a following setting.virtual topology
is fixed to[0, 2, 4, ...]
for ease of gradient estimationallreduce
works)0
and1
have estimated gradients (this step also checks that broadcasting works)test_gossip_grad_communication_cube
testsCUBE
topology in a following setting.virtual topology
is fixed to[0, 1, 2, ...]
for ease of gradient estimationallreduce
is not reducing any outside grads)0
andcomputation peer
have same estimated gradients (this step also checks that broadcasting works). This is because inCUBE
scenario peers receive gradients and send gradients to the same peer.Check list: