learning-at-home / hivemind

Decentralized deep learning in PyTorch. Built to train models on thousands of volunteers across the world.
MIT License
2k stars 158 forks source link

hivemind.averaging.partition.AllreduceException: Averaging step failed: could not find a group #519

Closed chavinlo closed 1 year ago

chavinlo commented 1 year ago

So I wrapped the optimizer with hivemind as stated in the docs. Everything goes well until it starts syncronizing. After 5 minutes it throws the following error:

Nov 12 20:18:11.061 [ERROR] [hivemind.averaging.averager._step:478] Averaging step failed: could not find a group
Traceback (most recent call last):
  File "/opt/conda/lib/python3.9/site-packages/hivemind/averaging/averager.py", line 448, in _step
    raise AllreduceException("Averaging step failed: could not find a group")
hivemind.averaging.partition.AllreduceException: Averaging step failed: could not find a group
Nov 12 20:18:11.063 [INFO] Averaging failed with <class 'hivemind.averaging.partition.AllreduceException'>

Is there anything that can be done? What could be the cause of this? I can attach the whole code if needed.

borzunov commented 1 year ago

Hi @chavinlo!

If possible, please share the code and the commands you use to start it on different machines, so we can look into the issue.

chavinlo commented 1 year ago

@borzunov Heres a gist of the training code I'm using, with hivemind included of course, I use a 3090 24GB. finetuner: https://gist.github.com/chavinlo/335266a3a6825ffafbec191e7d0e35bd requirements: https://gist.github.com/chavinlo/b7ea0a79cfeb0c59c57105c82c8b2f3e command: torchrun --nproc_per_node=1 finetune.py --model wdiffuser/ --run_name testrun --datasetserver="152.70.212.127:8080" --wantedimages=300 --resize="True" --gradient_checkpointing="True"

nproc_per_node: number of gpus to run on model: path of the model folder run_name: name of the run (not related to hivemind) datasetserver: address of the dataset server (the one provided is active) wantedimages: number of images to get every run resize: resize the images or not (otherwise it fills the vram a lot) gradient_checkpoiting: save vram

so pretty much all you need is the model, the one I use is waifu diffusion, direct link: https://huggingface.co/hakurei/waifu-diffusion-v1-3/resolve/main/wd-v1-3-full-opt.ckpt to run it you need the model in diffusers format too, heres huggingface script to convert it: https://github.com/huggingface/diffusers/blob/main/scripts/convert_original_stable_diffusion_to_diffusers.py and the output folder being the one you are going to pass as "model" on the finetuner

So just: Download the script, Install requirements, Download Model, Convert, Train. The first node will print its ip and adresses (dht). Then the second node should get started with the same command above but add '--peers="one dht address here"' to it. This will just pass it to init_peers.

Another issue I also have with it is a memory leak (15GB to 22GB on VRAM) which I assume is coming from the hivemind optimizer because it does not happens with the non-hivemind trainer. I usually clean it by running killall python and killall python3

This script does require having an online dataset server, I have one active at this moment and this is the IP: 152.70.212.127:8080 It basically downloads a small chunk of the dataset for it to train. Nothing explicit, just hololive artwork from danbooru, you can download the dataset here if you want to check: https://buck-ani.s3.filebase.com/hololive_general.zip

If you want to run a server on your own dataset heres the script: https://gist.github.com/chavinlo/0ee8e4556e9dc45934add12942eb9f53 python3 server_code.py --dataset="hololive_general" "hololive_general" being the path to the dataset folder, images and text files with pairs in filenames (ex.: data/101.jpg ; data/101.txt) and so on.

If you need more info just let me know. I am dep#2171, I think you are also on the hivemind discord, theres also a thread for this (stable diffusion finetuning) already there.

chavinlo commented 1 year ago

After multiple tests with the older script it seems it could be related to the timeout. I might try again later.