[BUG] Failed to load_state_from_peers at the first time because of "list index out of range" error

alex-snd commented 2 years ago

Describe the bug

A new peer cannot synchronize its state at the first time with other peers because of list index out of range error. And at best, new peer succeeds only on the second attempt, at worst, it cannot synchronize its state at all.

Failed to load state from peers: list index out of range, retrying ...
Traceback (most recent call last):
  File "/home/TRecover/venv/lib/python3.8/site-packages/hivemind/optim/optimizer.py", line 694, in load_state_from_peers
    self.state_averager.load_state_from_peers(timeout=self.load_state_timeout, **kwargs)
  File "/home/TRecover/venv/lib/python3.8/site-packages/hivemind/optim/state_averager.py", line 667, in load_state_from_peers
    load_optimizer_state(self.optimizer, metadata["optimizer_metadata"], loaded_opt_tensors)
  File "/home/TRecover/venv/lib/python3.8/site-packages/hivemind/optim/state_averager.py", line 720, in load_optimizer_state
    flat_optimizer_state.append(flat_tensors])
IndexError: list index out of range

I conducted an experiment to see how the new peer synchronizes its state with another peer (next, we will call it the first). Important clarification: the first peer and the new one have the same structure - same number of tensors on metadata (which contains all non-tensors).

def structure_shape(self) -> Tuple[int, int]:
    metadata, all_tensors, _= self.hivemind_optimizer.state_averager.get_current_state()
    return len(metadata['optimizer_metadata']), len(all_tensors)

So new_peer.structure_shape() == first_peer.structure_shape() == (790, 637)

After the new peer requested the first for its state, it dumps its state and yields metadata and 637 tensors in 5083 parts in this function. But new peer receives metadata and only 583 tensors in 5029 parts in this loop, then calls load_optimizer_state function here with downloaded state. Since the metadata assumes a structure with 637 tensors, an error list index out of range occurs due to the fact that it received not 637 but 583 tensors. After an error, new peer sends a request to the first peer to download its state again. This is repeated until the new peer manages to get all the parts of the tensors.

Thus, I realized that for some unknown reason, the new peer does not receive all parts of the tensors from the first peer. It is this async loop that does not always return all the parts. Then I found out that this error starts to occur after Update p2pd to v0.3.8 (and libp2p to v0.17.0) commit. And before that, everything works well.

To Reproduce

Prepare environment:

git clone -b hivemind_bag https://github.com/alex-snd/TRecover.git
cd TRecover
python3 -m venv venv
source venv/bin/activate
pip install git+https://github.com/learning-at-home/hivemind.git@de6b4f5ae835a633ca7876209f2929d069e988f0
pip install -e .[collab]
trecover init
trecover download data

Run the first peer:

trecover collab train --experiment-prefix bag --batch-size 1 --bandwidth 80

After a few seconds run the second (new) peer:

trecover collab train --initial-peers /COPY/ADDRESS/FROM/FIRST/PEER/CONSOLE/OUTPUT --experiment-prefix bag --batch-size 1 --bandwidth 80

Additionally, you can reinstall the library from the previous 35851c8ce96f74b0221c4a732cc22be070f3185f commit and make sure that everything works well with it:

pip uninstall hivemind -y
pip install git+https://github.com/learning-at-home/hivemind.git@35851c8ce96f74b0221c4a732cc22be070f3185f
# and repeat the experiment above.

Environment

python version 3.8.10 or above;
hivemind version: de6b4f5ae835a633ca7876209f2929d069e988f0 commit
Output from pytorch environment collection script: PyTorch version: 1.11.0+cu102 Is debug build: False CUDA used to build PyTorch: 10.2 ROCM used to build PyTorch: N/A OS: Ubuntu 20.04.4 LTS (x86_64) GCC version: (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0 Clang version: Could not collect CMake version: Could not collect Libc version: glibc-2.31 Python version: 3.8.10 (default, Jun 22 2022, 20:18:18) [GCC 9.4.0] (64-bit runtime) Python platform: Linux-5.4.0-124-generic-x86_64-with-glibc2.29 Is CUDA available: False CUDA runtime version: No CUDA GPU models and configuration: No CUDA Nvidia driver version: No CUDA cuDNN version: No CUDA HIP runtime version: N/A MIOpen runtime version: N/A Is XNNPACK available: True Versions of relevant libraries: [pip3] mypy-extensions==0.4.3 [pip3] numpy==1.22.0 [pip3] pytorch-lightning==1.6.4 [pip3] torch==1.11.0 [pip3] torchmetrics==0.9.2 [conda] Could not collect

justheuristic commented 2 years ago

Thanks for the detailed report! We're gonna check if perhaps the error goes away with older/newer versions of libp2p and report back what we found.

alex-snd commented 2 years ago

I suspect that this error is caused by using the QUIC transport that is always enabled as stated here)

I set the quic=True and this error started to occur even in the commit that previously worked fine.

learning-at-home / hivemind