GossipGraD implementation

aovladi commented 2 years ago

This PR introduces an implementation for GossipGraD - a gossip communication protocol for a large-scale training.

API consists of 3 things :

Topology specifies which topology will be used as a base for gradient communication. Either CUBE or DISSEMINATION. Current limitations: CUBE doesn't support uneven number of nodes. This comes from torch.distributed.distributed_c10d.batch_isend_irecv requirement for all processes to participate in a first communication. With CUBE + uneven number of nodes one of nodes won't have a partner for the first communication, thus communication is impossible.
GossipGraDState stores state needed to perform GossipGraD algorithm within a communication hook.
gossip_grad_hook communication hook.

Tests

test_gossip_grad_state_init makes sure state initializes properly and all assertions raised, if neccessary.
test_gossip_grad_communication_dissimination tests DISSEMINATION topology in a following setting.
- It requires at least 6 workers.
- Existing workers are separated in groups of 2, each such group is a local subgroup, which will use intra-node communication to reduce gradients before gossiping stage.
- virtual topology is fixed to [0, 2, 4, ...] for ease of gradient estimation
- supposed to be gradients are computed for rank 0 (assuming intra-node communication, thus checking that allreduce works)
- I make sure that ranks 0 and 1 have estimated gradients (this step also checks that broadcasting works)
- Make sure that the last node has different gradients
test_gossip_grad_communication_cube tests CUBE topology in a following setting.
- It requires at least 6 workers.
- Existing workers are separated in groups of 1, each such group is a local subgroup. For this setting no intra-node communication should happen.
- virtual topology is fixed to [0, 1, 2, ...] for ease of gradient estimation
- supposed to be gradients are computed for rank 0 (assuming no intra-node communication, thus checking that allreduce is not reducing any outside grads)
- I make sure that ranks 0 and computation peer have same estimated gradients (this step also checks that broadcasting works). This is because in CUBE scenario peers receive gradients and send gradients to the same peer.
- Make sure that the last node has different gradients.

Check list:

[ ] Was this discussed and approved via a GitHub issue? (not for typos or docs)
[x] Did you read the contributor guideline?
[x] Did you make sure that your PR does only one thing instead of bundling different changes together?
[x] Did you make sure to update the documentation with your changes? (if necessary)
[x] Did you write any new necessary tests?
[x] Did you verify new and existing tests pass locally with your changes?
[ ] Did you update the CHANGELOG? (not for typos, docs, or minor internal changes)

aovladi commented 2 years ago

Preliminary results for a 1 node 8 GPUs environment, T5 trained for 20 epochs. For GossipGraD every GPU is treated as a separate node and gossiping is happening between 8 workers according to algorithm's strategy. We can see that GossipGraD has higher loss, which is understandable due to the gossiping nature of the algorithm. Dissemination topology seems to outperform Cube topology.

epoch | T5 baseline | | T5 cube topology | | T5 Dissemination topology | -- | -- | -- | -- | -- | -- | -- | Train loss | Val loss | Train loss | Val loss | Train loss | Val loss 0 | 0.5319749713 | 0.07172757387 | 0.6147390008 | 0.07662624121 | 0.6097737551 | 0.07621511072 1 | 0.1037653387 | 0.03629698977 | 0.1144521311 | 0.04943695292 | 0.1163704321 | 0.05514102802 2 | 0.06340692937 | 0.02167312987 | 0.08212178946 | 0.02221572399 | 0.08911637962 | 0.02756368183 3 | 0.04630573094 | 0.01836876199 | 0.0566239953 | 0.01650846936 | 0.06193618476 | 0.02008434385 4 | 0.03806036711 | 0.01525175478 | 0.04499671236 | 0.01453202963 | 0.04938421026 | 0.01822413877 5 | 0.03251246363 | 0.01388591155 | 0.03816849738 | 0.01304518431 | 0.04330118373 | 0.01699820161 6 | 0.02878393233 | 0.01205849927 | 0.03363600001 | 0.01241523772 | 0.03741420433 | 0.01216574945 7 | 0.02600315027 | 0.01108295657 | 0.03085623123 | 0.01179173682 | 0.0310794618 | 0.01164832618 8 | 0.02370029502 | 0.01042199228 | 0.02827030048 | 0.01127741206 | 0.02809124999 | 0.01102655008 9 | 0.02200809307 | 0.01015946642 | 0.02633653581 | 0.01098989747 | 0.02600308321 | 0.01071218494 10 | 0.02062393539 | 0.009774439579 | 0.02475599572 | 0.01051200833 | 0.02438269742 | 0.01029212866 11 | 0.01939031854 | 0.009254269302 | 0.02343227156 | 0.01026331633 | 0.02303122357 | 0.009988318197 12 | 0.01839116402 | 0.009192653 | 0.02236232162 | 0.009947494604 | 0.02220279723 | 0.009867290966 13 | 0.01733924262 | 0.008881538175 | 0.0214888379 | 0.009735754691 | 0.02118554525 | 0.00957825128 14 | 0.01642912254 | 0.008818015456 | 0.0206311401 | 0.009749974124 | 0.0203171894 | 0.009568738751 15 | 0.01560532115 | 0.008875381202 | 0.01979763433 | 0.009525840171 | 0.01953843422 | 0.009297306649 16 | 0.01491208933 | 0.008848587982 | 0.01913302578 | 0.009336958639 | 0.01887300052 | 0.009192465805 17 | 0.01418704353 | 0.008672780357 | 0.01849162206 | 0.009204210714 | 0.01824515872 | 0.009089938365 18 | 0.01356204133 | 0.008460407145 | 0.01797264628 | 0.009171646088 | 0.01768364757 | 0.008925585076 19 | 0.01294515282 | 0.008718481287 | 0.01748350076 | 0.009162009694 | 0.01724962145 | 0.008809076622 | | | | | | best | 0.01294515282 | ***0.008460407145*** | 0.01748350076 | 0.009162009694 | 0.01724962145 | 0.008809076622

awgu commented 2 years ago

Sorry if you already explained this, but do we have a heuristic for when to use the CUBE or DISSEMINATION topology (assuming an even number of nodes)? Otherwise, it might be confusing to the user which one to choose.

aovladi commented 2 years ago

@awgu Thanks for the comment. I will add a note based on what authors provide, but I would like to experiment with both of them first on a bigger scale and understand advantages and drawbacks of every topology myself. Then I can properly explain differences in the documentation. At this point, DISSEMINATION topology is seemed to be preferred due to variety of communication peers, i.e. node sends and receives gradients from and to different peers, when for CUBE topology every node send and receive to and from the same partner. Experiments in the paper are shown for DISSEMINATION topology only, if I remember correct, so I would like to see maybe for some settings CUBE would perform better. If this is the case, I will update docs accordingly

rohan-varma commented 2 years ago

@aovladi In parallel, could we also work on a multi-node test setup where we can see the performance gain of gossip grad over all_reduce?

pytorch / torchdistx

GossipGraD implementation #48