NaN loss while training stage 1 VAE

Hi @ZENGXH ,

Thank you for sharing the code.

I am training VAE (stage 1) on the ShapeNet15k dataset by following the instructions given in the README.md file. I am using the default config, except the batch size is 16 (because using batch size 32 was giving cuda_out_of_memory error). The loss started increasing and eventually became nan. So, I trained with a lower learning rate of 1e-4 (originally it was 1e-3). This time again, the loss decreased, then increased, and becamenan.

Please see the contents of log file below:

2023-06-13 21:50:53.148 | INFO     | trainers.base_trainer:train_epochs:219 - [R0] | E0 iter[ 70/153] | [Loss] 335.14 | [exp] ../../output/lion_output/0613/car/cb9303h_hvae_lion_B16 | [step]    70 | [url] none
2023-06-13 21:51:53.192 | INFO     | trainers.base_trainer:train_epochs:219 - [R0] | E0 iter[152/153] | [Loss] 233.48 | [exp] ../../output/lion_output/0613/car/cb9303h_hvae_lion_B16 | [step]   152 | [url] none
2023-06-13 21:51:53.251 | INFO     | trainers.base_trainer:train_epochs:256 - [R0] | E0 iter[152/153] | [Loss] 233.48 | [exp] ../../output/lion_output/0613/car/cb9303h_hvae_lion_B16 | [step]   152 | [url] none | [time] 2.0m (~267h) |[best] 0 -100.000x1e-2
2023-06-13 21:52:53.518 | INFO     | trainers.base_trainer:train_epochs:219 - [R0] | E1 iter[ 81/153] | [Loss] 108.90 | [exp] ../../output/lion_output/0613/car/cb9303h_hvae_lion_B16 | [step]   234 | [url] none
2023-06-13 21:53:45.658 | INFO     | trainers.base_trainer:train_epochs:256 - [R0] | E1 iter[152/153] | [Loss] 100.31 | [exp] ../../output/lion_output/0613/car/cb9303h_hvae_lion_B16 | [step]   305 | [url] none | [time] 1.9m (~249h) |[best] 0 -100.000x1e-2
2023-06-13 21:54:46.026 | INFO     | trainers.base_trainer:train_epochs:219 - [R0] | E2 iter[ 81/153] | [Loss] 79.69 | [exp] ../../output/lion_output/0613/car/cb9303h_hvae_lion_B16 | [step]   387 | [url] none
2023-06-13 21:55:38.097 | INFO     | trainers.base_trainer:train_epochs:256 - [R0] | E2 iter[152/153] | [Loss] 76.43 | [exp] ../../output/lion_output/0613/car/cb9303h_hvae_lion_B16 | [step]   458 | [url] none | [time] 1.9m (~249h) |[best] 0 -100.000x1e-2
2023-06-13 21:56:38.487 | INFO     | trainers.base_trainer:train_epochs:219 - [R0] | E3 iter[ 81/153] | [Loss] 66.25 | [exp] ../../output/lion_output/0613/car/cb9303h_hvae_lion_B16 | [step]   540 | [url] none
2023-06-13 21:57:30.785 | INFO     | trainers.base_trainer:train_epochs:256 - [R0] | E3 iter[152/153] | [Loss] 63.98 | [exp] ../../output/lion_output/0613/car/cb9303h_hvae_lion_B16 | [step]   611 | [url] none | [time] 1.9m (~250h) |[best] 0 -100.000x1e-2
2023-06-13 21:58:31.106 | INFO     | trainers.base_trainer:train_epochs:219 - [R0] | E4 iter[ 81/153] | [Loss] 58.29 | [exp] ../../output/lion_output/0613/car/cb9303h_hvae_lion_B16 | [step]   693 | [url] none
2023-06-13 21:59:23.191 | INFO     | trainers.base_trainer:train_epochs:256 - [R0] | E4 iter[152/153] | [Loss] 57.15 | [exp] ../../output/lion_output/0613/car/cb9303h_hvae_lion_B16 | [step]   764 | [url] none | [time] 1.9m (~249h) |[best] 0 -100.000x1e-2
2023-06-13 22:00:23.558 | INFO     | trainers.base_trainer:train_epochs:219 - [R0] | E5 iter[ 81/153] | [Loss] 55.49 | [exp] ../../output/lion_output/0613/car/cb9303h_hvae_lion_B16 | [step]   846 | [url] none
2023-06-13 22:01:15.726 | INFO     | trainers.base_trainer:train_epochs:256 - [R0] | E5 iter[152/153] | [Loss] 55.84 | [exp] ../../output/lion_output/0613/car/cb9303h_hvae_lion_B16 | [step]   917 | [url] none | [time] 1.9m (~249h) |[best] 0 -100.000x1e-2
2023-06-13 22:02:16.029 | INFO     | trainers.base_trainer:train_epochs:219 - [R0] | E6 iter[ 81/153] | [Loss] 58.48 | [exp] ../../output/lion_output/0613/car/cb9303h_hvae_lion_B16 | [step]   999 | [url] none
2023-06-13 22:03:08.117 | INFO     | trainers.base_trainer:train_epochs:256 - [R0] | E6 iter[152/153] | [Loss] 59.70 | [exp] ../../output/lion_output/0613/car/cb9303h_hvae_lion_B16 | [step]  1070 | [url] none | [time] 1.9m (~249h) |[best] 0 -100.000x1e-2
2023-06-13 22:04:08.409 | INFO     | trainers.base_trainer:train_epochs:219 - [R0] | E7 iter[ 81/153] | [Loss] 64.31 | [exp] ../../output/lion_output/0613/car/cb9303h_hvae_lion_B16 | [step]  1152 | [url] none
2023-06-13 22:05:00.592 | INFO     | trainers.base_trainer:train_epochs:256 - [R0] | E7 iter[152/153] | [Loss] 65.85 | [exp] ../../output/lion_output/0613/car/cb9303h_hvae_lion_B16 | [step]  1223 | [url] none | [time] 1.9m (~249h) |[best] 0 -100.000x1e-2
2023-06-13 22:06:00.953 | INFO     | trainers.base_trainer:train_epochs:219 - [R0] | E8 iter[ 81/153] | [Loss] 70.98 | [exp] ../../output/lion_output/0613/car/cb9303h_hvae_lion_B16 | [step]  1305 | [url] none
2023-06-13 22:06:53.085 | INFO     | trainers.base_trainer:train_epochs:256 - [R0] | E8 iter[152/153] | [Loss] 72.55 | [exp] ../../output/lion_output/0613/car/cb9303h_hvae_lion_B16 | [step]  1376 | [url] none | [time] 1.9m (~249h) |[best] 0 -100.000x1e-2
2023-06-13 22:07:53.497 | INFO     | trainers.base_trainer:train_epochs:219 - [R0] | E9 iter[ 81/153] | [Loss] 77.83 | [exp] ../../output/lion_output/0613/car/cb9303h_hvae_lion_B16 | [step]  1458 | [url] none
2023-06-13 22:08:45.652 | INFO     | trainers.base_trainer:train_epochs:256 - [R0] | E9 iter[152/153] | [Loss] 79.42 | [exp] ../../output/lion_output/0613/car/cb9303h_hvae_lion_B16 | [step]  1529 | [url] none | [time] 1.9m (~249h) |[best] 0 -100.000x1e-2
2023-06-13 22:08:45.776 | INFO     | utils.exp_helper:get_evalname:94 - git hash: 13b1c
2023-06-13 22:08:47.341 | INFO     | trainers.base_trainer:eval_nll:743 - eval: 1/36
2023-06-13 22:08:51.946 | INFO     | trainers.base_trainer:eval_nll:743 - eval: 31/36
2023-06-13 22:09:00.621 | INFO     | utils.eval_helper:compute_NLL_metric:65 - best 10: tensor([ 57,   1, 349, 131, 113, 282, 271, 201, 108, 182], device='cuda:0')
2023-06-13 22:09:00.621 | INFO     | utils.eval_helper:compute_NLL_metric:72 - MMD-CD: 5.0256807604398546e-09
2023-06-13 22:09:00.622 | INFO     | utils.eval_helper:compute_NLL_metric:72 - MMD-EMD: 1.9488379621179774e-05
2023-06-13 22:09:00.622 | INFO     | utils.eval_helper:compute_NLL_metric:77 -
------------------------------------------------------------
../../output/lion_output/0613/car/cb9303h_hvae_lion_B16/recont_1529noemas1H13b1c.pt |
MMD-CD=0.000x1e-2 MMD-EMD=0.002x1e-2  step=1529
 none
 ------------------------------------------------------------
2023-06-13 22:09:00.622 | INFO     | trainers.base_trainer:eval_nll:814 - add: MMD-CD
2023-06-13 22:09:00.622 | INFO     | trainers.base_trainer:eval_nll:814 - add: MMD-EMD
2023-06-13 22:09:00.634 | INFO     | trainers.base_trainer:save:106 - save model as : ../../output/lion_output/0613/car/cb9303h_hvae_lion_B16/checkpoints/best_eval.pth
2023-06-13 22:09:10.367 | INFO     | trainers.common_fun:validate_inspect_noprior:104 - writer: none
2023-06-13 22:09:46.203 | INFO     | trainers.base_trainer:train_epochs:219 - [R0] | E10 iter[ 49/153] | [Loss] 83.91 | [exp] ../../output/lion_output/0613/car/cb9303h_hvae_lion_B16 | [step]  1579 | [url] none

I looked at previous issues #9 , #17 , #18 , #22 , #35 , but did not find any solution. Could you please tell me how to resolve this issue?

Also, could you please share the checkpoint you mentioned in this section?

Thank you, Supriya

nv-tlabs / LION

NaN loss while training stage 1 VAE #47