learning-at-home / hivemind

Decentralized deep learning in PyTorch. Built to train models on thousands of volunteers across the world.
MIT License
2.07k stars 169 forks source link

[BUG][MINOR] Downloading state during averaging (and vice versa) #445

Open justheuristic opened 2 years ago

justheuristic commented 2 years ago

(reported by CALM volunteers)

Describe the bug This happens to a new peer that joins training while others are averaging parameters. Since all peers are averaging parameters, the newbie peer will be stuck in the following loop:

This repeats for `floor(averaging_time / next_chunk_timeout) times until state averaging is done. Then it proceeds normally. In the worst case, if newcomer tries once for every other peer, it will skip initial load_state_from_peers. However, it will still detect being out-of-sync and retry.

To Reproduce

This is how it looks from user's perspective image

This is how it looks on an auxiliary peer: image

Environment

This behavior is an algorithmic side-effect of how averager is implemented in hivemind. It should not depend on python/pytorch versions.

Possible solutions (non-exhaustive)

blurry-mood commented 1 year ago

I'm encountering this issue, are there any workarounds?