Unable to decrease loss OR Unable to syncronize

chavinlo commented 1 year ago

[Pasted from the discord] Hello, I'm currently trying to train a Stable Diffusion Finetune using Hivemind's strategy on PyTorch Lightning. So far I got two peers to connect to each other and synchronize.

Description

The problem I currently have is that in order for the peers to have enough time to synchronize I have to set the target_batch_size to a high value, I tried with 8192 as that was what was specified on the pytorch lightning docs, and from my understanding, it basically means the batch count to accumulate before moving another step. However, when set to a high value (ex.: 8192) the training seems to go too fast to the point where it's unable to decrease the loss:

And when placing it to a low value, such as 32, the first node is able to decrease the loss but the other peer is unable to synchronize in time: Peer 1, main node, normal pace: Peer 2, trying to synchronize, fails:

The code base I am using is the Waifu Diffusion trainer: https://github.com/harubaru/waifu-diffusion which seems to work well when using other strategies such as deepspeed. The model being finetuned is the WD-1-3, Float32 https://huggingface.co/hakurei/waifu-diffusion-v1-3/blob/main/wd-v1-3-full-opt.ckpt These are the pieces of code I have modified: ./main.py

from pytorch_lightning.strategies import HivemindStrategy
from hivemind import Float16Compression
trainer = Trainer(
    gpus=1,
    precision=32, 
    amp_backend="native",
    strategy=HivemindStrategy(
        initial_peers=['tcp','udp'], 
        target_batch_size=32, 
        grad_compression=Float16Compression(), 
        state_averaging_compression=Float16Compression(), 
        verbose=True), 
    benchmark=True, 
    limit_val_batches=0, 
    num_sanity_val_steps=0, 
    accumulate_grad_batches=1)

My final question would be, is there a way to increase the target_batch_size without accelerating the learning rate too much? or is there at least a way to extend the step-per-batch count appropriately? The docs only have said parameter, and there aren't any public implementations of this strategy, at least on github other than a Rocket League Bot.

Attempts at fixing it

Discord user alex-snd recommended me using the average_state_every flag on optimizer_kwargs, however when placing it on the configuration as follows:

from pytorch_lightning.strategies import HivemindStrategy
from hivemind import Float16Compression
trainer = Trainer(
    gpus=1,
    precision=32, 
    amp_backend="native",
    strategy=HivemindStrategy(
        initial_peers=['tcp','udp'], 
        target_batch_size=32, 
        grad_compression=Float16Compression(), 
        state_averaging_compression=Float16Compression(), 
        verbose=True,
        optimizer_kwargs={'average_state_every': 5}), 
    benchmark=True, 
    limit_val_batches=0, 
    num_sanity_val_steps=0, 
    accumulate_grad_batches=1)

It fails to execute:

...
  File "/home/user/.local/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 1704, in _call_strategy_hook
    output = fn(*args, **kwargs)
  File "/home/user/.local/lib/python3.10/site-packages/pytorch_lightning/strategies/hivemind.py", line 276, in on_train_batch_start
    self._initialize_hivemind()
  File "/home/user/.local/lib/python3.10/site-packages/pytorch_lightning/strategies/hivemind.py", line 214, in _initialize_hivemind
    opt = hivemind.Optimizer(
TypeError: Optimizer.__init__() got an unexpected keyword argument 'optimizer_kwargs'

I have also tried editing the parameter itself directly from the optimizer.py file, but ends up in the same result as placing the target_batch_size to 32.

Enviroment

Hivemind version: hivemind==1.1.1 PyTorch version: 1.12.1 Is debug build: False CUDA used to build PyTorch: 11.6 ROCM used to build PyTorch: N/A

OS: Ubuntu 20.04.5 LTS (x86_64) GCC version: (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0 Clang version: Could not collect CMake version: Could not collect Libc version: glibc-2.31

Python version: 3.10.6 | packaged by conda-forge | (main, Aug 22 2022, 20:35:26) [GCC 10.4.0] (64-bit runtime) Python platform: Linux-5.4.0-128-generic-x86_64-with-glibc2.31 Is CUDA available: True CUDA runtime version: 11.6.124 CUDA_MODULE_LOADING set to: GPU models and configuration: GPU 0: NVIDIA A100-SXM4-80GB Nvidia driver version: 520.61.05 cuDNN version: Could not collect HIP runtime version: N/A MIOpen runtime version: N/A Is XNNPACK available: True

Versions of relevant libraries: [pip3] numpy==1.23.3 [pip3] pytorch-lightning==1.7.7 [pip3] torch==1.12.1 [pip3] torchaudio==0.12.1 [pip3] torchmetrics==0.10.0 [pip3] torchvision==0.13.1 [conda] blas 2.116 mkl conda-forge [conda] blas-devel 3.9.0 16_linux64_mkl conda-forge [conda] cudatoolkit 11.6.0 hecad31d_10 conda-forge [conda] ffmpeg 4.3 hf484d3e_0 pytorch [conda] libblas 3.9.0 16_linux64_mkl conda-forge [conda] libcblas 3.9.0 16_linux64_mkl conda-forge [conda] liblapack 3.9.0 16_linux64_mkl conda-forge [conda] liblapacke 3.9.0 16_linux64_mkl conda-forge [conda] mkl 2022.1.0 h84fe81f_915 conda-forge [conda] mkl-devel 2022.1.0 ha770c72_916 conda-forge [conda] mkl-include 2022.1.0 h84fe81f_915 conda-forge [conda] numpy 1.23.3 py310h53a5b5f_0 conda-forge [conda] pytorch 1.12.1 py3.10_cuda11.6_cudnn8.3.2_0 pytorch [conda] pytorch-mutex 1.0 cuda pytorch [conda] torchaudio 0.12.1 py310_cu116 pytorch [conda] torchvision 0.13.1 py310_cu116 pytorch

justheuristic commented 1 year ago

JFYI: we're taking long to answer r/n. If you closed this issue because you found the answer yourself, i'm glad you did. If not, please leave it open - we'll still answer as soon as we catch a breather.

chavinlo commented 1 year ago

JFYI: we're taking long to answer r/n. If you closed this issue because you found the answer yourself, i'm glad you did. If not, please leave it open - we'll still answer as soon as we catch a breather.

Yeah don't worry I got help from the discord and got it running but still have some problems with syncronization but I'm still tinkering them. Thanks for the reply.

learning-at-home / hivemind