Describe the bug
This happens to a new peer that joins training while others are averaging parameters. Since all peers are averaging parameters, the newbie peer will be stuck in the following loop:
newcomer requests state from a random peer
that peer is busy averaging parameters and will get stuck at this line: averager.py:658 (the lock for get_tensors is blocked by state_averager.step)
newcomer gets TimeoutError because target did not respond within next_chunk_timeout
newcomer prints an error message and tries again with the new peer -- that is also busy
This repeats for `floor(averaging_time / next_chunk_timeout) times until state averaging is done. Then it proceeds normally. In the worst case, if newcomer tries once for every other peer, it will skip initial load_state_from_peers. However, it will still detect being out-of-sync and retry.
To Reproduce
state_averager.step takes more time than next_chunk_timeout
This is how it looks from user's perspective
This is how it looks on an auxiliary peer:
Environment
This behavior is an algorithmic side-effect of how averager is implemented in hivemind. It should not depend on python/pytorch versions.
python version: 3.7 (or any other)
hivemind.version: master (1.1.0.dev0)
pytorch version: 1.10, numpy irrelevant
Possible solutions (non-exhaustive)
newcomer: somehow detect when state averaging is in progress and wait for up to averaging_timeout seconds?
add an option to not acquire lock during load_state_from_peers (this works fine now, but may be unsafe for some optimizers / averagers)
(reported by CALM volunteers)
Describe the bug This happens to a new peer that joins training while others are averaging parameters. Since all peers are averaging parameters, the newbie peer will be stuck in the following loop:
This repeats for `floor(averaging_time / next_chunk_timeout) times until state averaging is done. Then it proceeds normally. In the worst case, if newcomer tries once for every other peer, it will skip initial load_state_from_peers. However, it will still detect being out-of-sync and retry.
To Reproduce
state_averager.step
takes more time thannext_chunk_timeout
This is how it looks from user's perspective
This is how it looks on an auxiliary peer:
Environment
This behavior is an algorithmic side-effect of how averager is implemented in hivemind. It should not depend on python/pytorch versions.
Possible solutions (non-exhaustive)