pytorch / torchdistx

Torch Distributed Experimental
BSD 3-Clause "New" or "Revised" License
116 stars 31 forks source link

GossipGraD implementation #48

Closed aovladi closed 2 years ago

aovladi commented 2 years ago

This PR introduces an implementation for GossipGraD - a gossip communication protocol for a large-scale training.

API consists of 3 things :

Tests

  1. test_gossip_grad_state_init makes sure state initializes properly and all assertions raised, if neccessary.

  2. test_gossip_grad_communication_dissimination tests DISSEMINATION topology in a following setting.

    • It requires at least 6 workers.
    • Existing workers are separated in groups of 2, each such group is a local subgroup, which will use intra-node communication to reduce gradients before gossiping stage.
    • virtual topology is fixed to [0, 2, 4, ...] for ease of gradient estimation
    • supposed to be gradients are computed for rank 0 (assuming intra-node communication, thus checking that allreduce works)
    • I make sure that ranks 0 and 1 have estimated gradients (this step also checks that broadcasting works)
    • Make sure that the last node has different gradients
  3. test_gossip_grad_communication_cube tests CUBE topology in a following setting.

    • It requires at least 6 workers.
    • Existing workers are separated in groups of 1, each such group is a local subgroup. For this setting no intra-node communication should happen.
    • virtual topology is fixed to [0, 1, 2, ...] for ease of gradient estimation
    • supposed to be gradients are computed for rank 0 (assuming no intra-node communication, thus checking that allreduce is not reducing any outside grads)
    • I make sure that ranks 0 and computation peer have same estimated gradients (this step also checks that broadcasting works). This is because in CUBE scenario peers receive gradients and send gradients to the same peer.
    • Make sure that the last node has different gradients.

Check list:

aovladi commented 2 years ago

Preliminary results for a 1 node 8 GPUs environment, T5 trained for 20 epochs. For GossipGraD every GPU is treated as a separate node and gossiping is happening between 8 workers according to algorithm's strategy. We can see that GossipGraD has higher loss, which is understandable due to the gossiping nature of the algorithm. Dissemination topology seems to outperform Cube topology.

epoch | T5 baseline |   | T5 cube topology |   | T5 Dissemination topology |   -- | -- | -- | -- | -- | -- | --   | Train loss | Val loss | Train loss | Val loss | Train loss | Val loss 0 | 0.5319749713 | 0.07172757387 | 0.6147390008 | 0.07662624121 | 0.6097737551 | 0.07621511072 1 | 0.1037653387 | 0.03629698977 | 0.1144521311 | 0.04943695292 | 0.1163704321 | 0.05514102802 2 | 0.06340692937 | 0.02167312987 | 0.08212178946 | 0.02221572399 | 0.08911637962 | 0.02756368183 3 | 0.04630573094 | 0.01836876199 | 0.0566239953 | 0.01650846936 | 0.06193618476 | 0.02008434385 4 | 0.03806036711 | 0.01525175478 | 0.04499671236 | 0.01453202963 | 0.04938421026 | 0.01822413877 5 | 0.03251246363 | 0.01388591155 | 0.03816849738 | 0.01304518431 | 0.04330118373 | 0.01699820161 6 | 0.02878393233 | 0.01205849927 | 0.03363600001 | 0.01241523772 | 0.03741420433 | 0.01216574945 7 | 0.02600315027 | 0.01108295657 | 0.03085623123 | 0.01179173682 | 0.0310794618 | 0.01164832618 8 | 0.02370029502 | 0.01042199228 | 0.02827030048 | 0.01127741206 | 0.02809124999 | 0.01102655008 9 | 0.02200809307 | 0.01015946642 | 0.02633653581 | 0.01098989747 | 0.02600308321 | 0.01071218494 10 | 0.02062393539 | 0.009774439579 | 0.02475599572 | 0.01051200833 | 0.02438269742 | 0.01029212866 11 | 0.01939031854 | 0.009254269302 | 0.02343227156 | 0.01026331633 | 0.02303122357 | 0.009988318197 12 | 0.01839116402 | 0.009192653 | 0.02236232162 | 0.009947494604 | 0.02220279723 | 0.009867290966 13 | 0.01733924262 | 0.008881538175 | 0.0214888379 | 0.009735754691 | 0.02118554525 | 0.00957825128 14 | 0.01642912254 | 0.008818015456 | 0.0206311401 | 0.009749974124 | 0.0203171894 | 0.009568738751 15 | 0.01560532115 | 0.008875381202 | 0.01979763433 | 0.009525840171 | 0.01953843422 | 0.009297306649 16 | 0.01491208933 | 0.008848587982 | 0.01913302578 | 0.009336958639 | 0.01887300052 | 0.009192465805 17 | 0.01418704353 | 0.008672780357 | 0.01849162206 | 0.009204210714 | 0.01824515872 | 0.009089938365 18 | 0.01356204133 | 0.008460407145 | 0.01797264628 | 0.009171646088 | 0.01768364757 | 0.008925585076 19 | 0.01294515282 | 0.008718481287 | 0.01748350076 | 0.009162009694 | 0.01724962145 | 0.008809076622   |   |   |   |   |   |   best | 0.01294515282 | ***0.008460407145*** | 0.01748350076 | 0.009162009694 | 0.01724962145 | 0.008809076622
awgu commented 2 years ago

Sorry if you already explained this, but do we have a heuristic for when to use the CUBE or DISSEMINATION topology (assuming an even number of nodes)? Otherwise, it might be confusing to the user which one to choose.

aovladi commented 2 years ago

@awgu Thanks for the comment. I will add a note based on what authors provide, but I would like to experiment with both of them first on a bigger scale and understand advantages and drawbacks of every topology myself. Then I can properly explain differences in the documentation. At this point, DISSEMINATION topology is seemed to be preferred due to variety of communication peers, i.e. node sends and receives gradients from and to different peers, when for CUBE topology every node send and receive to and from the same partner. Experiments in the paper are shown for DISSEMINATION topology only, if I remember correct, so I would like to see maybe for some settings CUBE would perform better. If this is the case, I will update docs accordingly

rohan-varma commented 2 years ago

@aovladi In parallel, could we also work on a multi-node test setup where we can see the performance gain of gossip grad over all_reduce?