gojasper / flash-diffusion

Official implementation of ⚡ Flash Diffusion ⚡: Accelerating Any Conditional Diffusion Model for Few Steps Image Generation

https://gojasper.github.io/flash-diffusion-project/

Other

459 stars 33 forks source link

Train failed with multi gpus #18

Closed stone002 closed 1 month ago

stone002 commented 2 months ago

Hi, I meet this problem when I try to use 2 A100 to train.

I used train_flash_sdxl.py, my Trainer params are:

and the flash_sdxl.yaml is

My code stuck here every time. 975c0938e966d62fbb9b748e42e073f

If I use only one A100, it works fine, but very slow.

LeonNerd commented 1 month ago

hi, I have a question, when training sd3 flash, how should I create data, is the image in json the name of the image? It was also strange that I didn't have a model saved during my training

stone002 commented 1 month ago

hi, I have a question, when training sd3 flash, how should I create data, is the image in json the name of the image? It was also strange that I didn't have a model saved during my training

I only test sdxl, Flash Diffusion used 'webdataset' for training. I created my test dataset as the format, one image with a same name json file, the json looks like: { "jpg":img_name+".jpg", "json":{ "caption":caption, "aesthetic_score": 8.0, } } then package all images and jsons in a .tar. It works fine. About the model save, you may check your 'MAX_EPOCHS' and 'CKPT_EVERY_N_STEPS' in your .yaml file, or add 'save_last=True' in Trainner callbacks. Hope it helps!

stone002 commented 1 month ago

Finally I found that because my test dataset is not enough. I only create one .tar to test. I tried to get more data and make more .tar files, it works fine on multi gpus.

LeonNerd commented 1 month ago

Thanks for your reply, I continue to try again, but there is still no change.my datasets is one image with a same name json file.

stone002 commented 1 month ago

Thanks for your reply, I continue to try again, but there is still no change.my datasets is one image with a same name json file.

Maybe more data is needed I guess. My test .tar file contains 100 images and 100 json files, it works fine on single gpu.

LeonNerd commented 1 month ago

hi,I think it's very strange for me,but I can't figure out why.My json file is as follows： { "jpg": "people_flux_00209.png", "json": { "caption": "A person within an Africa-shaped border with a tattooed arm and sleeveless top.", "aesthetic_score": 8 } } and I packaged the image with the same name into a tar file as a training set with 3000 images.

Dataset

SHARDS_PATH_OR_URLS:

pipe:cat /data/sd3/flash_sd3-{000000..000000}.tar But no loss is shown in the training, and no model is saved. My code is as follows

stone002 commented 1 month ago

hi,I think it's very strange for me,but I can't figure out why.My json file is as follows： { "jpg": "people_flux_00209.png", "json": { "caption": "A person within an Africa-shaped border with a tattooed arm and sleeveless top.", "aesthetic_score": 8 } } and I packaged the image with the same name into a tar file as a training set with 3000 images.

Dataset

SHARDS_PATH_OR_URLS:

pipe:cat /data/sd3/flash_sd3-{000000..000000}.tar But no loss is shown in the training, and no model is saved. My code is as follows

I used sdxl for test. I modified these in my code:

My test dataset is 100 [image, json] pairs, these values works for me but inference effect is not very good, I guess because my data and train epochs are both not enough. You can have a try by setting CKPT_EVERY_N_STEPS smaller to check if it saves .ckpt in training process. Hope it helps