Open Anonnoname opened 1 year ago
Additionally, what is the method for determining if the diffusion model has converged or not? I noticed that the loss ceased to decrease in the early epochs, but the overall quality of the samples has continued to improve over time.
hello,have you ever encountered a situation where the loss becomes nan when training VAE
@Anonnoname for the VAE training, it usually converged after the KL annealing stop. The criterion of a good VAE is that it can achieve a reasonably good reconstruction performance while the latent points look (slightly) smoother than the input points. In my experiment, the latent points will look like this at iter 144400:
I feel like your reconstruction is worse than expectation. And the latent points is over-smoothed. This is usually caused by a high KL loss weight. Are you using the default config?
For the diffusion model, the loss tend to have high variance: it's hard to judge from the loss about the convergence. I usually 1) evaluate the checkpoint every 1000 epoch and determine from the evaluation metric and 2) visualize the results. My experience is that LION usually converge at around 10k iteration.
@fradino for the NaN issue, could you start another issue and post your log & config so that I can help with that?
Hello! I'd like to ask how I can determine if my VAE model has converged. Which metrics or loss should I look at? When I'm training on the car dataset, as the KL weights increase, the latent points become more noisy, leading to a decrease in reconstruction quality. Is it possible that if I keep training the model, the reconstruction quality will continue to get worse? If so, how can I know when to stop training?
I used the default config. trainer.epochs set to 800. step 25480