lyc0929 / OOTDiffusion-train

Other
119 stars 20 forks source link

Any estimate when training code will be ready? #1

Open i-amgeek opened 6 months ago

i-amgeek commented 6 months ago

Hi @lyc0929 , exciting initiative from you. Just wanted to know if you have any rough estimates on when this code will be ready for training?

BTW, I can also help you with AWS GPUs if you need for fast-tracking your work.

lyc0929 commented 6 months ago

Thank you for your generous help. If needed in the future, I will contact you. At present, the model needs to run for 200 hours to train 144000 iters using the VITON-HD dataset with a single A100 and a batch size of 16. And the author uses a single A100 with a batch size of 64, which I estimate will only take 30 hours. On the one hand, I am optimizing my code to save graphics memory and expand the batch size. On the other hand, I am debugging distributed training to use more A100 for training.

DevelMayCry-MrChen commented 6 months ago

F.mse_loss is nan why?

lyc0929 commented 6 months ago

F.mse_loss is nan why?

Maybe you should check your model inputs

DevelMayCry-MrChen commented 6 months ago

F.mse_loss is nan why?

Maybe you should check your model inputs emb = self.time_embedding(t_emb, timestep_cond)

DevelMayCry-MrChen commented 6 months ago

F.mse_loss is nan why?

Maybe you should check your model inputs

emb = self.time_embedding(t_emb, timestep_cond) nan in it why?

i-amgeek commented 6 months ago

Ok @lyc0929. Once you improve the memory utilization and distributed training, we can train it on cluster of 8 A100s.

I have some ideas to prepare large dataset of ~1M images as well.