A new peer cannot synchronize its state at the first time with other peers because of list index out of range error. And at best, new peer succeeds only on the second attempt, at worst, it cannot synchronize its state at all.
Failed to load state from peers: list index out of range, retrying ...
Traceback (most recent call last):
File "/home/TRecover/venv/lib/python3.8/site-packages/hivemind/optim/optimizer.py", line 694, in load_state_from_peers
self.state_averager.load_state_from_peers(timeout=self.load_state_timeout, **kwargs)
File "/home/TRecover/venv/lib/python3.8/site-packages/hivemind/optim/state_averager.py", line 667, in load_state_from_peers
load_optimizer_state(self.optimizer, metadata["optimizer_metadata"], loaded_opt_tensors)
File "/home/TRecover/venv/lib/python3.8/site-packages/hivemind/optim/state_averager.py", line 720, in load_optimizer_state
flat_optimizer_state.append(flat_tensors])
IndexError: list index out of range
I conducted an experiment to see how the new peer synchronizes its state with another peer (next, we will call it the first).
Important clarification: the first peer and the new one have the same structure - same number of tensors on metadata (which contains all non-tensors).
So new_peer.structure_shape() == first_peer.structure_shape() == (790, 637)
After the new peer requested the first for its state, it dumps its state and yields metadata and 637 tensors in 5083 parts in this function. But new peer receives metadata and only 583 tensors in 5029 parts in this loop, then calls load_optimizer_state function here with downloaded state. Since the metadata assumes a structure with 637 tensors, an error list index out of range occurs due to the fact that it received not 637 but 583 tensors. After an error, new peer sends a request to the first peer to download its state again. This is repeated until the new peer manages to get all the parts of the tensors.
Thus, I realized that for some unknown reason, the new peer does not receive all parts of the tensors from the first peer. It is this async loop that does not always return all the parts. Then I found out that this error starts to occur after Update p2pd to v0.3.8 (and libp2p to v0.17.0) commit. And before that, everything works well.
Output from pytorch environment collection script:
PyTorch version: 1.11.0+cu102
Is debug build: False
CUDA used to build PyTorch: 10.2
ROCM used to build PyTorch: N/A
OS: Ubuntu 20.04.4 LTS (x86_64)
GCC version: (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0
Clang version: Could not collect
CMake version: Could not collect
Libc version: glibc-2.31
Python version: 3.8.10 (default, Jun 22 2022, 20:18:18) [GCC 9.4.0] (64-bit runtime)
Python platform: Linux-5.4.0-124-generic-x86_64-with-glibc2.29
Is CUDA available: False
CUDA runtime version: No CUDA
GPU models and configuration: No CUDA
Nvidia driver version: No CUDA
cuDNN version: No CUDA
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True
Versions of relevant libraries:
[pip3] mypy-extensions==0.4.3
[pip3] numpy==1.22.0
[pip3] pytorch-lightning==1.6.4
[pip3] torch==1.11.0
[pip3] torchmetrics==0.9.2
[conda] Could not collect
Describe the bug
A new peer cannot synchronize its state at the first time with other peers because of
list index out of range
error. And at best, new peer succeeds only on the second attempt, at worst, it cannot synchronize its state at all.I conducted an experiment to see how the new peer synchronizes its state with another peer (next, we will call it the first). Important clarification: the first peer and the new one have the same structure - same number of tensors on metadata (which contains all non-tensors).
So
new_peer.structure_shape() == first_peer.structure_shape() == (790, 637)
After the new peer requested the first for its state, it dumps its state and yields metadata and 637 tensors in 5083 parts in this function. But new peer receives metadata and only 583 tensors in 5029 parts in this loop, then calls load_optimizer_state function here with downloaded state. Since the metadata assumes a structure with 637 tensors, an
error list index out of range
occurs due to the fact that it received not 637 but 583 tensors. After an error, new peer sends a request to the first peer to download its state again. This is repeated until the new peer manages to get all the parts of the tensors.Thus, I realized that for some unknown reason, the new peer does not receive all parts of the tensors from the first peer. It is this async loop that does not always return all the parts. Then I found out that this error starts to occur after Update p2pd to v0.3.8 (and libp2p to v0.17.0) commit. And before that, everything works well.
To Reproduce
Prepare environment:
Run the first peer:
After a few seconds run the second (new) peer:
Additionally, you can reinstall the library from the previous 35851c8ce96f74b0221c4a732cc22be070f3185f commit and make sure that everything works well with it:
Environment