Closed yufeng9819 closed 1 year ago
I further visualize the best_eval.pth and the results are as follows:
the best_eval.pth seems to save wrong information, since it directly visualize the gt points.
What is wrong with best_eval.pth?
I don't think you need to train it for longer:
best_val.pth
? (it should be saved in the pth file), it might be very early epoch; Since our latent points is initialized as gt points, and the vae is initialized as identity mapping, you will see such figure at the beginning. in general the longer you train, the worse reconstruction you will get (as shown in the val EMD/CD curve), but smoother latent space (i.e. the latent points closer to N(0,1), this will make training the diffusion model easier). And we need to find a good trade off between them, In the figure you show the latent points is super smooth, I feel like the model can be stopped earlier.
The best_val.pth is saved at the epoch 199.
In this training process, I do not manage to launch tensorboard, cause I find that at utils.utils.py, USE_TFB = int(os.environ.get('USE_TFB', 0)) (line 27), is always False. So it fail to 'init TFB'.
Following your training setting, I only save checkpoints at epoch 1999, 3999, best_eval.pth and snapshot. I also visualize the checkpoints at epoch 1999. The results are as follows:
The results also seem to be bad.
Another problem is in trainer.eval_sample, it requires to load refval%s.pt (in fact, it load ref_val_all.pt for all cats of pointflow data). But in READEME.md, I only find PF2_val_all.pt. Is it the right reference information for evaluating all cats of pointflow datasets?@ZENGXH
I find that in training log, model calculate CD and EMD every 200 epochs. Then I manually visualize these result: CD and EMD increase in almost training process.
Furthermore, I want to know the advantage of LION.
To my knowledge, one of the advantages of latent diffusion model is to speed up training process. However, in LION, it consists of two latent code and their dimension are 128 and 8192 (=2048*4, almost same as points dimension), which means LION cannot speed up the whole training process. Can I understand in this way? If so, can you explain why you design LION in this way (for the purpose of improving the quality of generation results or some other purpose?)
Looking forward to your reply. @ZENGXH
script/train_vae_all.sh
? norm_box
, so if you are using script/train_vae_all.sh
they are the same for evaluation; Since train_vae_all
use norm_box
and it normalize of the val data eval_sample
is used to evaluate the samples from prior: are you training the prior now? viz.save_freq 200
); (2)reduce the vis freq (viz.viz_freq -200
). And (3) trained for 1K epoch. If you are going to re-launch the job, I suggest you logging them to comet_ml
(setting up the account is free), it will be much easier to monitor the progress. Or you can try turn on the Tensorboard/wandb (but I havn't use it for a long time so I am not sure whether they are still compatible or not). Re:
The results above are trained by script/train_vae_all.sh.
I train the VAE by script/train_vae_all.sh and change the lr from 1e-4 to 1e-3 and batchsize to 24. Also I visualize the training result by Tensorboard every day (for about 300 epcochs every day). The result are as follows:
I got it, and I comment the eval_sample when I analysis VAE training results.
One more questions, have you ever applied the LION to generate more complex models? (for example try to generate models with more than 2048 points even with 3w points, or try to apply LION to generate scene-level models)
Hey,
I visualize my VAE training result and find that it cannot performance as well as your results show in tensorboard.
My visualization results are as follows:
Results above are obtained from training 4000 epochs by train_prior_clip with batch_size 12 and lr 1e-4 (I do not change any parameters) on 8 V100.
I want to check whether the VAE model is trained enough and then I visualize training results at around 5000 epochs.
But the results did not get better.
I want to know the reason why I cannot get a reasonable results after 20 days training. (Do I need to train further, but in general, this is an unacceptable length of training)
Looking forward to your reply. @ZENGXH