VAE visulazation problem

yufeng9819 commented 1 year ago

Hey,

I visualize my VAE training result and find that it cannot performance as well as your results show in tensorboard.

My visualization results are as follows:

Results above are obtained from training 4000 epochs by train_prior_clip with batch_size 12 and lr 1e-4 (I do not change any parameters) on 8 V100.

I want to check whether the VAE model is trained enough and then I visualize training results at around 5000 epochs.

But the results did not get better.

I want to know the reason why I cannot get a reasonable results after 20 days training. (Do I need to train further, but in general, this is an unacceptable length of training)

Looking forward to your reply. @ZENGXH

yufeng9819 commented 1 year ago

I further visualize the best_eval.pth and the results are as follows:

the best_eval.pth seems to save wrong information, since it directly visualize the gt points.

What is wrong with best_eval.pth?

ZENGXH commented 1 year ago

I don't think you need to train it for longer:

could you check the epoch for the best_val.pth? (it should be saved in the pth file), it might be very early epoch; Since our latent points is initialized as gt points, and the vae is initialized as identity mapping, you will see such figure at the beginning.
do you have the val EMD or CD curve plot in training?
- If yes, you can choose the checkpoint that has similar EMD/CD as the training curve I plot in the other issue.
- if not, you can try to evaluate the checkpoint < 4k epoch, like 3k & 2k;

in general the longer you train, the worse reconstruction you will get (as shown in the val EMD/CD curve), but smoother latent space (i.e. the latent points closer to N(0,1), this will make training the diffusion model easier). And we need to find a good trade off between them, In the figure you show the latent points is super smooth, I feel like the model can be stopped earlier.

yufeng9819 commented 1 year ago

The best_val.pth is saved at the epoch 199.

In this training process, I do not manage to launch tensorboard, cause I find that at utils.utils.py, USE_TFB = int(os.environ.get('USE_TFB', 0)) (line 27), is always False. So it fail to 'init TFB'.

Following your training setting, I only save checkpoints at epoch 1999, 3999, best_eval.pth and snapshot. I also visualize the checkpoints at epoch 1999. The results are as follows:

The results also seem to be bad.

Another problem is in trainer.eval_sample, it requires to load refval%s.pt (in fact, it load ref_val_all.pt for all cats of pointflow data). But in READEME.md, I only find PF2_val_all.pt. Is it the right reference information for evaluating all cats of pointflow datasets?@ZENGXH

yufeng9819 commented 1 year ago

I find that in training log, model calculate CD and EMD every 200 epochs. Then I manually visualize these result: EMD CD and EMD increase in almost training process.

yufeng9819 commented 1 year ago

Furthermore, I want to know the advantage of LION.

To my knowledge, one of the advantages of latent diffusion model is to speed up training process. However, in LION, it consists of two latent code and their dimension are 128 and 8192 (=2048*4, almost same as points dimension), which means LION cannot speed up the whole training process. Can I understand in this way? If so, can you explain why you design LION in this way (for the purpose of improving the quality of generation results or some other purpose?)

Looking forward to your reply. @ZENGXH

ZENGXH commented 1 year ago

it's wired that your EMD and CD grow much faster; are you using script/train_vae_all.sh?
the trend that EMD and CD continue increasing is expected, since the KL weight is increasing over 1/2 of the total epochs, and as the KL weight increase, the model keep smoothing / compressing the latent space (this make the learning of the diffusion model easier), and sacrificing the reconstruction.
Re: ref_val_all.pt VS PF2_val_all.pt: the difference is that the PF2 is normalized with norm_box, so if you are using script/train_vae_all.sh they are the same for evaluation; Since train_vae_all use norm_box and it normalize of the val data
eval_sample is used to evaluate the samples from prior: are you training the prior now?
for E1999, it seems the better ckpt happens before that.. How long does it take to train 1 epoch? Is it possible that you could re-launch the job with (1) smaller snapshot period, like 200 (viz.save_freq 200); (2)reduce the vis freq (viz.viz_freq -200). And (3) trained for 1K epoch. If you are going to re-launch the job, I suggest you logging them to comet_ml (setting up the account is free), it will be much easier to monitor the progress. Or you can try turn on the Tensorboard/wandb (but I havn't use it for a long time so I am not sure whether they are still compatible or not).
The advantage of LION is the increase of model expressivity&capacity. Latent diffusion model is not always (trained) faster than output space diffusion model: it depends on the goal of designing the model. For example, if you want to speed up the training / inference, you may want to make the latent space smaller. But in LION our goal is to have a high performance point cloud generator. Our latent space is designed in a way that the 1D latent can capture some global information about the shape, and such hierarchical latent space improve the performance especially when the model is trained on large scale & multi-class data.

yufeng9819 commented 1 year ago

Re:

The results above are trained by script/train_vae_all.sh.
I train the VAE by script/train_vae_all.sh and change the lr from 1e-4 to 1e-3 and batchsize to 24. Also I visualize the training result by Tensorboard every day (for about 300 epcochs every day). The result are as follows:
- CD and EMD: The upward curve is steeper than the previous training process
- epoch 633:
- epoch 843:
- epoch 1158: In this training process, the visualization results seem to be normal, since the model reconstruct great result while the latent gradually contains more and more noise. But the CD and MMD curve seem to be abnormal. @ZENGXH
I got it, and I comment the eval_sample when I analysis VAE training results.

One more questions, have you ever applied the LION to generate more complex models? (for example try to generate models with more than 2048 points even with 3w points, or try to apply LION to generate scene-level models)

ZENGXH commented 1 year ago

what's the difference between this MMD-CD plot verse the one here? Are they from the same run? It seems the scale is pretty different?
It seems the training is in a good shape anyway! Nice that the tensorboard is working. Perhaps it just need a few hundreds of extra epoch to smooth out. In case you didn't change the snapshot freq (i.e., the ckpt is saved every 2k epoch), you can try to copy the snapshot file manually (the snapshot is saved every 30 min here)
on shapenet, I tried 3k points, and I also try scene level models, it can still work to some extend (I just have some initial trial, though)

yufeng9819 commented 1 year ago

The MMD-CD plot is the result of my second run.
I manually change the save freg to 200 by stop training and then resume the model to continue to train , and before that I manually save snapshot every time when I visualize VAE results by Tensorboard. . I hope I can get reasonable results in the end!
Thanks for your kind respond! @ZENGXH

nv-tlabs / LION

VAE visulazation problem #43