Roadmap - Githubissues

justheuristic commented 4 years ago

This is a global project roadmap that states our priorities for the nearest future. These priorities can and should be disputed here or elsewhere, after which we will update the roadmap.

v0.7 "It runs something" (released)

[x] convert internal hivemind code to an open-source library
[x] switch from dmueller/kademlia to an internal DHT implementation
[x] run proof-of-concept experiments

v0.8 "It runs at scale" (released)

[x] Ensure DHT scales to 1000+ nodes
- [x] Switch from rpcudp to gRPC (due to scalability issues)
[x] Optimizer hivemind.Server for large amount of small tensors
[x] Implement parallel backward in RemoteMixtureOfExperts
[x] Add benchmarks for MoE and DHT performance
[x] Publish to PyPI

v0.9 "It trains something" (released)

[x] Averaging gating function over peers #95
[x] Speed up beam search #92
[x] Optional tensor compression / quantization for RemoteExpert #88
[x] Refactoring concerns #98
[x] Open-source experiments outside our test infrastructure ( e.g. this )

v0.10 "You can train with us" (released)

Extended tutorials:
- [x] step by step tutorial on model training with DecentralizedAverager (#219 )
- [ ] ~~tutorial on defining and training custom experts~~ (postponed)
Make it easier to contribute compute to hivemind:
- [x] NAT traversal for household PCs ( #165 )
- [x] Support running hivemind in google colab / from behind firewalls (#146 #147 )
Further optimizations for DecentralizedAverager
- [ ] ~~Elastic scaling of moshpit averaging with the number of active trainers~~ (found workaround for now)
- [x] Advanced compression strategies to reduce communication throughput ( #170 #195 )
[x] Distributed training security
- [x] audit security issues #93
- [x] make it difficult for a malicious peer to jeopardize training ( #219 , TBC)
[x] Refactoring concerns #98

v1.0 "most of the code makes sense without reading the source" (nov-dec)

[x] overhaul optimizers
- must work decently with default parameters for both examples
- see optimizer roadmap in #398
[x] update quickstart.md
- use the the new optimizer instead of DecentralizedSGD
[x] overhaul DHT benchmark
[x] add Optimizer benchmark

v1.1 "You can set up collaborative training easily"

Target scenario: 100 volunteers training 2xl-like over the internet

[ ] add more examples
- at least one should include set up guide
[ ] additional tutorial with computer vision (dino, imagenet, dalle?)
Do something about the number of open files
- [x] investigate what contributes to # open files
- [x] is there a (cheap) way to reduce that to at 4096 (or 4096) without compromising performance?
[x] Support training with only client and aux peers
- (A) ensure that aux peers can download state from clients or
- (B) add an option for aux peer to pretend as normal with batch size = 0
[x] more extreme compression: powerSGD variant(s)
[x] investigate QUIC at scale
- [x] test hole punching
- [ ] make sure our config fully supports relays
[x] Remove duplicate CI runs
[ ] Add warnings to typical failure modes
[x] Deprecate CollaborativeOptimizer & co

1.2 Decentralized Model-parallelism

Target scenario: 500 peers training 1B+ over the internet

[x] libp2p in hivemind.server
[ ] proper LoadBalancedExpert
[ ] hivemind.Optimizer in hivemind.server
[ ] FP16 in hivemind.server

Important, but not urgent

[x] more extreme compression: some way to integrate BNB directly
[ ] Security: option to use CenteredClip
[ ] Some means of saving expert snapshot in a fault-tolerant way #94
[ ] Some means for storing the training data (a-la scientific torrents)
[ ] enhanced API of hivemind.Optimizer (extract all necessary methods of StateAverager/ProgressTracker)
[ ] moshpit + elasticity
[ ] alternative linear programming variants

louis030195 commented 2 years ago

This sounds an interesting project, I like the idea of decentralized computing a bit like cryptocurrencies does, but not for currency, rather for general computing, because computing can be quite expensive. In my mind it would have looked a bit like a Kubernetes but decentralized, without any security issues regarding others' hardware access and probably based on trading computing for a kind of currency (yes still would be cheaper than current centralized computing clouds), though.

About the roadmap, I see exciting technical details, but I don't see how people will see and find themselves sharing their resources for a common goal? Is there any plan to develop a UI or something like?

Example: Bob and Alice want to train a GPT3 200B parameters, but Bob can only afford half the training price, same for Alice, but with this awesome UI, they could see that they match into a common goal.

borzunov commented 2 years ago

Hi @louis030195!

probably based on trading computing for a kind of currency

Yeah, there are a couple of projects related to this idea: vast.ai provides a service for users to lease/rent each other's GPUs, and BitTensor (cc @unconst) is built around a cryptocurrency serving as an incentive for people who help train models with their GPUs.

Currently, hivemind doesn't involve any financial incentives: we assume that volunteers are motivated by having access to the training outcome and recognition in the leaderboard. However, if time shows that the financial motivation is crucial, hivemind may serve as a backend for BitTensor nodes :)

I don't see how people will see and find themselves sharing their resources for a common goal? Is there any plan to develop a UI or something like?

For now, we assume this happens like this:

Initial collaborators find each other using our Discord or social media
They discuss their model/dataset choices and write code responsible for the model and dataset streaming
They create a page explaining other people how to join and advertise it (e.g., in social media)
People can follow instructions on the page, joining using their own GPUs or free cloud services like Google Colab

An example of such a page is our demo where we train a DALL-E-like model.

However, I definitely agree that our project will benefit from a centralized UI where a new user can see all planned/ongoing training runs and join the ones they consider interesting :)

learning-at-home / hivemind

Roadmap #77