learning-at-home / hivemind

Decentralized deep learning in PyTorch. Built to train models on thousands of volunteers across the world.
MIT License
1.99k stars 157 forks source link

[BUG] Unable to train a bloat16-compressed model #545

Open the-beee opened 1 year ago

the-beee commented 1 year ago

Describe the bug

Jan 04 22:30:14.302 [INFO] test-run-1112b accumulated 10 samples for epoch #0 from 2 peers. ETA 0.00 sec (refresh in 0.50 sec)
Jan 04 22:30:14.476 [INFO] Beginning optimizer step #0
Jan 04 22:31:26.924 [ERROR] [hivemind.optim.power_sgd_averager._aggregate_with_group:187] Expected out tensor to have dtype c10::BFloat16, but got float instead
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/hivemind/optim/power_sgd_averager.py", line 159, in _aggregate_with_group
    torch.matmul(m.reshape(-1, q.size(0)), q, out=p)
RuntimeError: Expected out tensor to have dtype c10::BFloat16, but got float instead
Jan 04 22:31:26.925 [WARN] [hivemind.optim.power_sgd_averager._register_allreduce_group:129] All-reduce group b'\xc4\xaf\xafv&\xcd\xd6q\xa3\xea\x9d-\x13\x0f\xa4hNQ\xf6>PHASE_P' did not finish.
Jan 04 22:31:26.925 [WARN] [hivemind.optim.power_sgd_averager._register_allreduce_group:129] All-reduce group b'\xc4\xaf\xafv&\xcd\xd6q\xa3\xea\x9d-\x13\x0f\xa4hNQ\xf6>PHASE_Q' did not finish.
Jan 04 22:31:26.925 [WARN] [hivemind.averaging.averager._step:482] PowerSGDGradientAverager caught MatchmakingException('Unable to run All-Reduce: Expected out tensor to have dtype c10::BFloat16, but got float instead'), retrying
Jan 04 22:35:47.094 [ERROR] [hivemind.optim.power_sgd_averager._aggregate_with_group:187] Expected out tensor to have dtype c10::BFloat16, but got float instead
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/hivemind/optim/power_sgd_averager.py", line 159, in _aggregate_with_group
    torch.matmul(m.reshape(-1, q.size(0)), q, out=p)
RuntimeError: Expected out tensor to have dtype c10::BFloat16, but got float instead
Jan 04 22:35:47.094 [WARN] [hivemind.optim.power_sgd_averager._register_allreduce_group:129] All-reduce group b'Q\x84\x8d\xa9\xf3\x90\xd4\xdf\xcc]\x153\x0c+\x9e\x90|\xed|\x8ePHASE_P' did not finish.
Jan 04 22:35:47.094 [WARN] [hivemind.optim.power_sgd_averager._register_allreduce_group:129] All-reduce group b'Q\x84\x8d\xa9\xf3\x90\xd4\xdf\xcc]\x153\x0c+\x9e\x90|\xed|\x8ePHASE_Q' did not finish.
Jan 04 22:35:47.095 [WARN] [hivemind.averaging.averager._step:482] PowerSGDGradientAverager caught MatchmakingException('Unable to run All-Reduce: Expected out tensor to have dtype c10::BFloat16, but got float instead'), retrying
Jan 04 22:40:07.221 [ERROR] [hivemind.optim.power_sgd_averager._aggregate_with_group:187] Expected out tensor to have dtype c10::BFloat16, but got float instead

To Reproduce

git clone https://github.com/the-beee/naifu-diffusion
cd naifu-diffusion
pip install -r requirements.txt
python trainer.py

Please update config/distributed.yaml to include the peers address in the hivemind section, before starting the second peer.

Environment

Collecting environment information...
PyTorch version: 1.13.1+cu117
Is debug build: False
CUDA used to build PyTorch: 11.7
ROCM used to build PyTorch: N/A

OS: Ubuntu 22.04.1 LTS (x86_64)
GCC version: (Ubuntu 11.3.0-1ubuntu1~22.04) 11.3.0
Clang version: Could not collect
CMake version: Could not collect
Libc version: glibc-2.35

Python version: 3.10.6 (main, Nov 14 2022, 16:10:14) [GCC 11.3.0] (64-bit runtime)
Python platform: Linux-5.15.0-56-generic-x86_64-with-glibc2.35
Is CUDA available: False
CUDA runtime version: No CUDA
CUDA_MODULE_LOADING set to: N/A
GPU models and configuration: No CUDA
Nvidia driver version: No CUDA
cuDNN version: No CUDA
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True

Versions of relevant libraries:
[pip3] numpy==1.24.1
[pip3] pytorch-lightning==1.8.6
[pip3] torch==1.13.1
[pip3] torch-ema==0.3
[pip3] torchmetrics==0.11.0
[pip3] torchvision==0.14.1
[pip3] hivemind==1.1.4
[conda] Could not collect
justheuristic commented 1 year ago

Hi! Thanks for the detailed report! It is indeed a bug, and we'll fix it in the nearest release. In the meantime, i'm afrain that the only override is to keep float32 params with hivemind.Optimizer - while the on-device model is in bfloat16.