nv-tlabs / LION

Latent Point Diffusion Models for 3D Shape Generation
Other
735 stars 57 forks source link

VAE Training Time #42

Closed yufeng9819 closed 1 year ago

yufeng9819 commented 1 year ago

Hey, Thanks for your great work.@ZENGXH I would like to ask how long it takes to train VAE in all categories. I train VAE in all categories on 8 V100 16GB for 15days with batchsize 12. But only 4000 epochs have been trained.

2023-04-20 18:52:20.615 | INFO | trainers.base_trainer:train_epochs:256 - [R0] | E4112 iter[371/372] | [Loss] 8847.80 | [exp] ../exp/0405/all/1c389bh_hvae_lion_B12 | [step] 1530035 | [url] none | [time] 5.0m (~325h) |[best] 199 0.001x1e-2

Is there anyway to accelerate the training process? ( For example: increase batchsize ?)

Another problem is that it is hard for me to judge whether VAE is well trained (I think visualisation is not a comprehensive way to reflect the effectiveness of VAE training). Especially when the training process takes a lot of time, it is important to guarantee the training effect.

ZENGXH commented 1 year ago

Hi, I think 15 days is probably enough (I only train for 7 days with 4A100, I stop early due to the paper deadline). for 55 class, we don't need to run the same number of epochs as the single class data since there is too much data.

In terns of acceleration, yes increase batch-size should help, especially for diffusion model training.

One thing you can try is to investigate the loss curve, and see whether they are at the flatten region (converged stage). For the diffusion model training, you can evaluate the 1-nna metric to see whether it's fully converged or not.

For reference, this the my reconstruction results when I stop my vae training: valrecont_step_889600 This is my training curve: image image image

yufeng9819 commented 1 year ago

Great! Thanks for your response.