Open i-amgeek opened 6 months ago
Thank you for your generous help. If needed in the future, I will contact you. At present, the model needs to run for 200 hours to train 144000 iters using the VITON-HD dataset with a single A100 and a batch size of 16. And the author uses a single A100 with a batch size of 64, which I estimate will only take 30 hours. On the one hand, I am optimizing my code to save graphics memory and expand the batch size. On the other hand, I am debugging distributed training to use more A100 for training.
F.mse_loss is nan why?
F.mse_loss is nan why?
Maybe you should check your model inputs
F.mse_loss is nan why?
Maybe you should check your model inputs emb = self.time_embedding(t_emb, timestep_cond)
F.mse_loss is nan why?
Maybe you should check your model inputs
emb = self.time_embedding(t_emb, timestep_cond) nan in it why?
Ok @lyc0929. Once you improve the memory utilization and distributed training, we can train it on cluster of 8 A100s.
I have some ideas to prepare large dataset of ~1M images as well.
Hi @lyc0929 , exciting initiative from you. Just wanted to know if you have any rough estimates on when this code will be ready for training?
BTW, I can also help you with AWS GPUs if you need for fast-tracking your work.