learning-at-home / hivemind

Decentralized deep learning in PyTorch. Built to train models on thousands of volunteers across the world.
MIT License
1.99k stars 157 forks source link

How well does it scale? #575

Open lonnietc opened 1 year ago

lonnietc commented 1 year ago

Hello,

I am researching P2P solutions and am wondering how well Hivemind scales?

Thanks

borzunov commented 1 year ago

Hi @lonnietc,

Please take a look at our publications that contain experiments for different algorithms implemented in hivemind: https://github.com/learning-at-home/hivemind#citation (take a look at the newest papers in "Additional publications" too). Hope it helps!

@justheuristic @mryab cc-ing you in case you have anything else to say.

justheuristic commented 1 year ago

Hivemind has several components that have different scaling properties.

For instance, hivemind.dht.DHT scales into 8192 nodes more or less seamlessly - and can probably larger if we had the RAM (and patience) to test it.

In turn, hivemind.Optimizer requires some tweaking to go beyond 256 nodes - different averaging timeouts and/or groups. The only time (to my knowledge) we tested it with more than 1k nodes it required multiple averaging groups as in this paper.

As for hivemind.moe, it's scaling properties depend on the network design. Having a model with multiple smaller MoE layers scales to more nodes than one big MoE. Having 2d grid scales better than 1d grid. I'd hazard a guess that a single MoE layer can scale into thousands of nodes with some tinkering (grid, beam search paams), but i haven't ever done that.