nv-tlabs / LION

Latent Point Diffusion Models for 3D Shape Generation
Other
735 stars 57 forks source link

Increasing loss #17

Open fradino opened 1 year ago

fradino commented 1 year ago

Hello, I try to train the VAE, follow the step image but the loss is increasing image

ZENGXH commented 1 year ago

Hi, this is expected since we 1) increase KL loss weight from 1e-7 to 0.5 throughout the training, i.e., the magnitude of the KL loss is increasing and 2) initialize the VAE as an identity mapping, i.e., it will have perfect reconstruction at the early iteration. As the KL weight increase, you will see the reconstruction loss getting higher as well. As a result, the loss curve will keep increasing until the KL weight reach 0.5. This is the loss curve for my experiment train on car using the default hyper-parameter: image

fradino commented 1 year ago

Thank you!It‘s helpful to me. Could you show me the loss curve of train diffusion prior?

fradino commented 1 year ago

And the loss becomes NAN after the step 3332 image

ZENGXH commented 1 year ago

this is my epoch loss: image

Are you using the default config? Could you share some visualization (target, reconstruction, latent points) of the VAE training and some samples of the prior training?

fradino commented 1 year ago

I'm training VAE with the default config, and I find 1676867794078 x_0_pred becomes inf after training

ZENGXH commented 1 year ago

Thanks for the sharing. This looks wired. I didn't see this before. Could you try if reducing the learning rate by half can fix this issue or not?

fradino commented 1 year ago

Thanks for the sharing. This looks wired. I didn't see this before. Could you try if reducing the learning rate by half can fix this issue or not?

The only change I made was to change the BS from 32 to 16. I will try to reduce the learning rate by half. @ZENGXH

yuanzhen2020 commented 1 year ago

As you mentioned, the weight of KL loss will increase as the training progresses, and the reconstruction loss will also increase. I have a question about how to evaluate the performance of the trained VAE model or is there an indicator to evaluation throughout the training? Another question is: do you have some advises how to optimize this training parameters? @ZENGXH

ZENGXH commented 1 year ago

@yuanzhen2020 I usually look at the reconstructed point cloud and the latent points. A VAE that is well trained need to 1) has smooth latent points, the points will close to a Gaussian distribution and 2) maintain a good reconstruction (by checking both visualization and the reconstructed EMD and CD metric); we need to achieve a good trade off between 1) and 2).

In general vae training, another thing that may be helpful is to track the un-weighted KL + reconstruction loss, i.e., the ELBO value. The value should be decreasing through the training. I didn't track this since in LION the KL value is much larger than reconstruction loss: it will dominate too much in the ELBO.

Eventually, we care about the sample quality. So the ultimate way to verify whether a VAE is good enough or not is to train the prior and compare the sample metric on it. (but this is expansive).

In terms of training parameters, it seems tuning the dropout ratio, and the model size can make some difference in the performance.