Closed chavinlo closed 1 year ago
Hi @chavinlo!
If possible, please share the code and the commands you use to start it on different machines, so we can look into the issue.
@borzunov
Heres a gist of the training code I'm using, with hivemind included of course, I use a 3090 24GB.
finetuner: https://gist.github.com/chavinlo/335266a3a6825ffafbec191e7d0e35bd
requirements: https://gist.github.com/chavinlo/b7ea0a79cfeb0c59c57105c82c8b2f3e
command: torchrun --nproc_per_node=1 finetune.py --model wdiffuser/ --run_name testrun --datasetserver="152.70.212.127:8080" --wantedimages=300 --resize="True" --gradient_checkpointing="True"
nproc_per_node: number of gpus to run on model: path of the model folder run_name: name of the run (not related to hivemind) datasetserver: address of the dataset server (the one provided is active) wantedimages: number of images to get every run resize: resize the images or not (otherwise it fills the vram a lot) gradient_checkpoiting: save vram
so pretty much all you need is the model, the one I use is waifu diffusion, direct link: https://huggingface.co/hakurei/waifu-diffusion-v1-3/resolve/main/wd-v1-3-full-opt.ckpt to run it you need the model in diffusers format too, heres huggingface script to convert it: https://github.com/huggingface/diffusers/blob/main/scripts/convert_original_stable_diffusion_to_diffusers.py and the output folder being the one you are going to pass as "model" on the finetuner
So just: Download the script, Install requirements, Download Model, Convert, Train. The first node will print its ip and adresses (dht). Then the second node should get started with the same command above but add '--peers="one dht address here"' to it. This will just pass it to init_peers.
Another issue I also have with it is a memory leak (15GB to 22GB on VRAM) which I assume is coming from the hivemind optimizer because it does not happens with the non-hivemind trainer. I usually clean it by running killall python
and killall python3
This script does require having an online dataset server, I have one active at this moment and this is the IP: 152.70.212.127:8080
It basically downloads a small chunk of the dataset for it to train. Nothing explicit, just hololive artwork from danbooru, you can download the dataset here if you want to check: https://buck-ani.s3.filebase.com/hololive_general.zip
If you want to run a server on your own dataset heres the script:
https://gist.github.com/chavinlo/0ee8e4556e9dc45934add12942eb9f53
python3 server_code.py --dataset="hololive_general"
"hololive_general" being the path to the dataset folder, images and text files with pairs in filenames (ex.: data/101.jpg ; data/101.txt) and so on.
If you need more info just let me know. I am dep#2171, I think you are also on the hivemind discord, theres also a thread for this (stable diffusion finetuning) already there.
After multiple tests with the older script it seems it could be related to the timeout. I might try again later.
So I wrapped the optimizer with hivemind as stated in the docs. Everything goes well until it starts syncronizing. After 5 minutes it throws the following error:
Is there anything that can be done? What could be the cause of this? I can attach the whole code if needed.