[BUG][MINOR] Downloading state during averaging (and vice versa)

(reported by CALM volunteers)

Describe the bug This happens to a new peer that joins training while others are averaging parameters. Since all peers are averaging parameters, the newbie peer will be stuck in the following loop:

newcomer requests state from a random peer
that peer is busy averaging parameters and will get stuck at this line: averager.py:658 (the lock for get_tensors is blocked by state_averager.step)
newcomer gets TimeoutError because target did not respond within next_chunk_timeout
newcomer prints an error message and tries again with the new peer -- that is also busy

This repeats for `floor(averaging_time / next_chunk_timeout) times until state averaging is done. Then it proceeds normally. In the worst case, if newcomer tries once for every other peer, it will skip initial load_state_from_peers. However, it will still detect being out-of-sync and retry.

To Reproduce

state_averager.step takes more time than next_chunk_timeout

This is how it looks from user's perspective

This is how it looks on an auxiliary peer:

Environment

This behavior is an algorithmic side-effect of how averager is implemented in hivemind. It should not depend on python/pytorch versions.

python version: 3.7 (or any other)
hivemind.version: master (1.1.0.dev0)
pytorch version: 1.10, numpy irrelevant

Possible solutions (non-exhaustive)

newcomer: somehow detect when state averaging is in progress and wait for up to averaging_timeout seconds?
add an option to not acquire lock during load_state_from_peers (this works fine now, but may be unsafe for some optimizers / averagers)

learning-at-home / hivemind

[BUG][MINOR] Downloading state during averaging (and vice versa) #445