nv-tlabs / LION

Latent Point Diffusion Models for 3D Shape Generation
Other
740 stars 58 forks source link

Determine vae model convergence #18

Open Anonnoname opened 1 year ago

Anonnoname commented 1 year ago

Hello! I'd like to ask how I can determine if my VAE model has converged. Which metrics or loss should I look at? When I'm training on the car dataset, as the KL weights increase, the latent points become more noisy, leading to a decrease in reconstruction quality. Is it possible that if I keep training the model, the reconstruction quality will continue to get worse? If so, how can I know when to stop training?

I used the default config. trainer.epochs set to 800. step 25480 image

Anonnoname commented 1 year ago

Additionally, what is the method for determining if the diffusion model has converged or not? I noticed that the loss ceased to decrease in the early epochs, but the overall quality of the samples has continued to improve over time.

fradino commented 1 year ago

hello,have you ever encountered a situation where the loss becomes nan when training VAE

ZENGXH commented 1 year ago

@Anonnoname for the VAE training, it usually converged after the KL annealing stop. The criterion of a good VAE is that it can achieve a reasonably good reconstruction performance while the latent points look (slightly) smoother than the input points. In my experiment, the latent points will look like this at iter 144400: image_6fc08d79379340ef84edbad646191a45-IW9QIRc9GGdkjIWEmpk5twNsE

I feel like your reconstruction is worse than expectation. And the latent points is over-smoothed. This is usually caused by a high KL loss weight. Are you using the default config?

ZENGXH commented 1 year ago

For the diffusion model, the loss tend to have high variance: it's hard to judge from the loss about the convergence. I usually 1) evaluate the checkpoint every 1000 epoch and determine from the evaluation metric and 2) visualize the results. My experience is that LION usually converge at around 10k iteration.

ZENGXH commented 1 year ago

@fradino for the NaN issue, could you start another issue and post your log & config so that I can help with that?