The VAE training loss is very high at the early stage of training.

OswaldoBornemann commented 1 year ago

@ZENGXH Hi, I am trying to train LION with my own dataset. My dataset contains nearly 800k point clouds, and the dimension of each point cloud is 6, which has both coordinate and color features. The number of points in each point cloud varies from 1,000 to 3,000. Considering the training time based on the size of the dataset, I reduced the trainer.epochs from 8,000 to 25. And the lr from 1e-3 to 1e-4. I trained the VAE on 4 A6000 GPUs, with 32 batch sizes. However, I found that the training loss is very high at the early stage. Is it normal or not? Thanks for sharing your insight.

2023-08-26 21:43:22.176 | INFO     | trainers.base_trainer:train_epochs:313 - [R0] | E0 iter[2829/4110] | [Loss] 21827.68 | [exp] ./exp/0826/own_datas/77303ch_hvae_lion_B32N3564 | [step]  2829 | [url]
2023-08-26 21:44:22.966 | INFO     | trainers.base_trainer:train_epochs:313 - [R0] | E0 iter[2877/4110] | [Loss] 22176.30 | [exp] ./exp/0826/own_datas/77303ch_hvae_lion_B32N3564 | [step]  2877 | [url]
2023-08-26 21:44:52.837 | INFO     | trainers.base_trainer:train_epochs:340 - [vis_recon] step is 2900
2023-08-26 21:44:54.732 | INFO     | trainers.base_trainer:train_epochs:343 - [vis_sample] step is 2900
2023-08-26 21:45:02.185 | INFO     | trainers.common_fun:validate_inspect_noprior:117 - writer:
2023-08-26 21:45:23.076 | INFO     | trainers.base_trainer:train_epochs:313 - [R0] | E0 iter[2915/4110] | [Loss] 22456.15 | [exp] ./exp/0826/own_datas/77303ch_hvae_lion_B32N3564 | [step]  2915 | [url]
2023-08-26 21:46:23.381 | INFO     | trainers.base_trainer:train_epochs:313 - [R0] | E0 iter[2964/4110] | [Loss] 22821.78 | [exp] ./exp/0826/own_datas/77303ch_hvae_lion_B32N3564 | [step]  2964 | [url]
2023-08-26 21:47:09.460 | INFO     | trainers.base_trainer:train_epochs:340 - [vis_recon] step is 3000
2023-08-26 21:47:11.429 | INFO     | trainers.base_trainer:train_epochs:343 - [vis_sample] step is 3000
2023-08-26 21:47:22.454 | INFO     | trainers.common_fun:validate_inspect_noprior:117 - writer:
2023-08-26 21:47:23.566 | INFO     | trainers.base_trainer:train_epochs:313 - [R0] | E0 iter[3001/4110] | [Loss] 23088.34 | [exp] ./exp/0826/own_datas/77303ch_hvae_lion_B32N3564 | [step]  3001 | [url]
2023-08-26 21:48:24.627 | INFO     | trainers.base_trainer:train_epochs:313 - [R0] | E0 iter[3049/4110] | [Loss] 23436.57 | [exp] ./exp/0826/own_datas/77303ch_hvae_lion_B32N3564 | [step]  3049 | [url]
2023-08-26 21:49:24.946 | INFO     | trainers.base_trainer:train_epochs:313 - [R0] | E0 iter[3099/4110] | [Loss] 23802.02 | [exp] ./exp/0826/own_datas/77303ch_hvae_lion_B32N3564 | [step]  3099 | [url]
2023-08-26 21:49:26.044 | INFO     | trainers.base_trainer:train_epochs:340 - [vis_recon] step is 3100
2023-08-26 21:49:28.004 | INFO     | trainers.base_trainer:train_epochs:343 - [vis_sample] step is 3100
2023-08-26 21:49:35.415 | INFO     | trainers.common_fun:validate_inspect_noprior:117 - writer:
2023-08-26 21:50:25.896 | INFO     | trainers.base_trainer:train_epochs:313 - [R0] | E0 iter[3138/4110] | [Loss] 24084.43 | [exp] ./exp/0826/own_datas/77303ch_hvae_lion_B32N3564 | [step]  3138 | [url]
2023-08-26 21:51:26.173 | INFO     | trainers.base_trainer:train_epochs:313 - [R0] | E0 iter[3190/4110] | [Loss] 24470.79 | [exp] ./exp/0826/own_datas/77303ch_hvae_lion_B32N3564 | [step]  3190 | [url]
2023-08-26 21:51:39.437 | INFO     | trainers.base_trainer:train_epochs:340 - [vis_recon] step is 3200
2023-08-26 21:51:44.995 | INFO     | trainers.base_trainer:train_epochs:343 - [vis_sample] step is 3200
2023-08-26 21:51:52.331 | INFO     | trainers.common_fun:validate_inspect_noprior:117 - writer:
2023-08-26 21:52:26.832 | INFO     | trainers.base_trainer:train_epochs:313 - [R0] | E0 iter[3227/4110] | [Loss] 24743.04 | [exp] ./exp/0826/own_datas/77303ch_hvae_lion_B32N3564 | [step]  3227 | [url]
2023-08-26 21:53:27.083 | INFO     | trainers.base_trainer:train_epochs:313 - [R0] | E0 iter[3276/4110] | [Loss] 25101.06 | [exp] ./exp/0826/own_datas/77303ch_hvae_lion_B32N3564 | [step]  3276 | [url]
2023-08-26 21:53:56.654 | INFO     | trainers.base_trainer:train_epochs:340 - [vis_recon] step is 3300
2023-08-26 21:53:58.537 | INFO     | trainers.base_trainer:train_epochs:343 - [vis_sample] step is 3300
2023-08-26 21:54:05.733 | INFO     | trainers.common_fun:validate_inspect_noprior:117 - writer:
2023-08-26 21:54:28.263 | INFO     | trainers.base_trainer:train_epochs:313 - [R0] | E0 iter[3317/4110] | [Loss] 25405.83 | [exp] ./exp/0826/own_datas/77303ch_hvae_lion_B32N3564 | [step]  3317 | [url]
2023-08-26 21:55:28.448 | INFO     | trainers.base_trainer:train_epochs:313 - [R0] | E0 iter[3367/4110] | [Loss] 25780.41 | [exp] ./exp/0826/own_datas/77303ch_hvae_lion_B32N3564 | [step]  3367 | [url]
2023-08-26 21:56:10.038 | INFO     | trainers.base_trainer:train_epochs:340 - [vis_recon] step is 3400
2023-08-26 21:56:15.462 | INFO     | trainers.base_trainer:train_epochs:343 - [vis_sample] step is 3400
2023-08-26 21:56:22.744 | INFO     | trainers.common_fun:validate_inspect_noprior:117 - writer:
2023-08-26 21:56:29.347 | INFO     | trainers.base_trainer:train_epochs:313 - [R0] | E0 iter[3405/4110] | [Loss] 26059.93 | [exp] ./exp/0826/own_datas/77303ch_hvae_lion_B32N3564 | [step]  3405 | [url]
2023-08-26 21:57:29.583 | INFO     | trainers.base_trainer:train_epochs:313 - [R0] | E0 iter[3455/4110] | [Loss] 26434.48 | [exp] ./exp/0826/own_datas/77303ch_hvae_lion_B32N3564 | [step]  3455 | [url]

ZENGXH commented 12 months ago

I think it's possible. Since the number of points is large, the overall loss scale can be high. A better check if things is going well is checking the reconstruction points: ideally, you will see the point clouds are well reconstructed (especially at the early iter), the latent points will look similar to the point cloud at the beginning and as training going on, it will be smoothed out.

OswaldoBornemann commented 12 months ago

I think it's possible. Since the number of points is large, the overall loss scale can be high. A better check if things is going well is checking the reconstruction points: ideally, you will see the point clouds are well reconstructed (especially at the early iter), the latent points will look similar to the point cloud at the beginning and as training going on, it will be smoothed out.

Yes. I will be smoothed out. And I found that my own dataset is very sensitive to the kl_weight. When the kl_weight approaches 0.5, I found that the latent points are totally different from the input point clouds. Based on this observation, I tried to decrease the kl weight and try to reach a situation where the latent points are slightly different from the input points, which is slightly smoothed out.

ZENGXH commented 11 months ago

This makes sense to me. kl_weight can be an important hyper-parameter to tune when the dataset changed. Hope the experiments go well!

OswaldoBornemann commented 11 months ago

This makes sense to me. kl_weight can be an important hyper-parameter to tune when the dataset changed. Hope the experiments go well!

Thanks a lot. Really appreciated.

nv-tlabs / LION

The VAE training loss is very high at the early stage of training. #53