tgxs002 / align_sd

Better Aligning Text-to-Image Models with Human Preference. ICCV 2023
https://tgxs002.github.io/align_sd_web/
Apache License 2.0
265 stars 9 forks source link

About the training data and time cost of lora fine-tuning #12

Open cjfcsjt opened 1 year ago

cjfcsjt commented 1 year ago

Thanks for your great work! I wonder how much time it costs to train the Lora on the dataset (1M DiffusionDB + subset of Laion5B) on, for example, 4 GPUs?

tgxs002 commented 1 year ago

It takes a few hours to train 10k iterations with batch size 40. The exact time depends on the type of GPU.

cjfcsjt commented 1 year ago

Thanks for your reply!! I have another question about the training dataset: According to the paper, it utilizes {20,205 prompts, 79,167 images} to train the classifier; while utilizing 37,572 preferred generated text-image pairs + 21,108 non-preferred text-image pairs (from DiffusionDB 1M) + The 200,231 regularization text-image pairs (from 625k subset of LAION-5B) to fine-tune the Lora. Is my understanding correct? I wonder if it is possible to directly fine-tune the Lora using the {20,205 prompts, 79,167 images} human preference dataset?

tgxs002 commented 1 year ago

Yes, the statistics are correct. You can try tuning the LoRA on HPD, but I feel the effect may be limited by the dataset size. The noise of the dataset may also be an issue.