Multi GPU Training Problem - Githubissues

nv-tlabs / LION

Latent Point Diffusion Models for 3D Shape Generation

Other

735 stars 57 forks source link

Multi GPU Training Problem #38

Closed yufeng9819 closed 1 year ago

yufeng9819 commented 1 year ago

Hey! Thanks for your wonderful work again.@ZENGXH

But now I meet another problem. I want to know why the training process is unstable?

I train VAE model on all categories (bash ./scripts/train_vae_all.sh)with batchsize of 12 on 8 V100 16GB.

At the start of the training, the loss is decreasing. (decrease from 167) ` 2023-04-05 14:08:18.381 | INFO | trainers.base_trainer:train_epochs:219 - [R0] | E0 iter[ 53/372] | [Loss] 167.43 | [exp] ../exp/0405/all/7d8f96h_hvae_lion_B12 | [step] 53 | [url] none

2023-04-05 14:09:18.824 | INFO | trainers.base_trainer:train_epochs:219 - [R0] | E0 iter[129/372] | [Loss] 88.55 | [exp] ../exp/0405/all/7d8f96h_hvae_lion_B12 | [step] 129 | [url] none

2023-04-05 14:10:19.573 | INFO | trainers.base_trainer:train_epochs:219 - [R0] | E0 iter[205/372] | [Loss] 62.51 | [exp] ../exp/0405/all/7d8f96h_hvae_lion_B12 | [step] 205 | [url] none

2023-04-05 14:11:19.673 | INFO | trainers.base_trainer:train_epochs:219 - [R0] | E0 iter[280/372] | [Loss] 49.84 | [exp] ../exp/0405/all/7d8f96h_hvae_lion_B12 | [step] 280 | [url] none

2023-04-05 14:12:20.123 | INFO | trainers.base_trainer:train_epochs:219 - [R0] | E0 iter[355/372] | [Loss] 42.31 | [exp] ../exp/0405/all/7d8f96h_hvae_lion_B12 | [step] 355 | [url] none `

However, the loss starts to increase when it decrease to around 14.

` 2023-04-05 14:13:33.545 | INFO | trainers.base_trainer:train_epochs:219 - [R0] | E1 iter[ 71/372] | [Loss] 14.14 | [exp] ../exp/0405/all/7d8f96h_hvae_lion_B12 | [step] 443 | [url] none

2023-04-05 14:14:34.110 | INFO | trainers.base_trainer:train_epochs:219 - [R0] | E1 iter[147/372] | [Loss] 14.47 | [exp] ../exp/0405/all/7d8f96h_hvae_lion_B12 | [step] 519 | [url] none

2023-04-05 14:15:34.777 | INFO | trainers.base_trainer:train_epochs:219 - [R0] | E1 iter[223/372] | [Loss] 14.91 | [exp] ../exp/0405/all/7d8f96h_hvae_lion_B12 | [step] 595 | [url] none

2023-04-05 14:16:35.255 | INFO | trainers.base_trainer:train_epochs:219 - [R0] | E1 iter[299/372] | [Loss] 15.37 | [exp] ../exp/0405/all/7d8f96h_hvae_lion_B12 | [step] 671 | [url] none

2023-04-05 14:17:32.903 | INFO | trainers.base_trainer:train_epochs:256 - [R0] | E1 iter[371/372] | [Loss] 15.86 | [exp] ../exp/0405/all/7d8f96h_hvae_lion_B12 | [step] 743 | [url] none | [time] 5.0m (~665h) |[best] 0 -100.000x1e-2

2023-04-05 14:18:32.966 | INFO | trainers.base_trainer:train_epochs:219 - [R0] | E2 iter[ 70/372] | [Loss] 19.02 | [exp] ../exp/0405/all/7d8f96h_hvae_lion_B12 | [step] 814 | [url] none

2023-04-05 14:19:33.599 | INFO | trainers.base_trainer:train_epochs:219 - [R0] | E2 iter[144/372] | [Loss] 19.69 | [exp] ../exp/0405/all/7d8f96h_hvae_lion_B12 | [step] 888 | [url] none

2023-04-05 14:20:34.311 | INFO | trainers.base_trainer:train_epochs:219 - [R0] | E2 iter[217/372] | [Loss] 20.36 | [exp] ../exp/0405/all/7d8f96h_hvae_lion_B12 | [step] 961 | [url] none

2023-04-05 14:21:34.365 | INFO | trainers.base_trainer:train_epochs:219 - [R0] | E2 iter[290/372] | [Loss] 21.03 | [exp] ../exp/0405/all/7d8f96h_hvae_lion_B12 | [step] 1034 | [url] none

2023-04-05 14:22:35.093 | INFO | trainers.base_trainer:train_epochs:219 - [R0] | E2 iter[364/372] | [Loss] 21.72 | [exp] ../exp/0405/all/7d8f96h_hvae_lion_B12 | [step] 1108 | [url] none

2023-04-05 14:22:41.203 | INFO | trainers.base_trainer:train_epochs:256 - [R0] | E2 iter[371/372] | [Loss] 21.78 | [exp] ../exp/0405/all/7d8f96h_hvae_lion_B12 | [step] 1115 | [url] none | [time] 5.1m (~684h) |[best] 0 -100.000x1e-2

2023-04-05 14:23:41.649 | INFO | trainers.base_trainer:train_epochs:219 - [R0] | E3 iter[ 72/372] | [Loss] 25.93 | [exp] ../exp/0405/all/7d8f96h_hvae_lion_B12 | [step] 1188 | [url] none

`

I want to know why the training process is so unstable and how to fix this problem.

Looking forward to your reply!

ZENGXH commented 1 year ago

This is log of my previous experiments on 55 classes:

Image description

It also have similar behavior: I think this is expected. In terms of the reason why the loss increase: you can check here

I think at the very early iteration, the KL weight is relatively small, the model will be more focus on optimizing the reconstruction loss, and the overall loss is decreasing. After some steps, the KL loss will dominate the overall loss, so the overall loss tend to increase.

yufeng9819 commented 1 year ago

I got it!

Thanks for your reply!