nv-tlabs / GET3D

Other
4.17k stars 374 forks source link

Training instability with motorcycle #138

Open pallgeuer opened 11 months ago

pallgeuer commented 11 months ago

Hi, I am trying to use this code to train on the motorcycle data but the training is proving to be unstable. I have done the blender renders as described and have all 337 models with 96 renders per model. I train as follows:

python train_3d.py --outdir=OUTPATH --data=RENDER/img/03790512 --camera_path=RENDER/camera --gpus=2 --batch=32 --batch-gpu=16 --mbstd-group=4 --gamma=80 --data_camera_mode=shapenet_motorbike --img_res=1024 --dmtet_scale=1.0 --use_shapenet_split=1 --one_3d_generator=1 --fp32=0 --workers=4

This should be essentially the default documented training parameters except that I'm running on 2xA100 instead of 8xA100.

My issue is that the FID50k only decreases from ~250 initially to ~85 (more than the 50-65 expected from the paper), and at around 2000-3000kimg (out of the planned 20000kimg) the training diverges and never recovers. What parameters should I use so that your code on your data can actually finish training?

I would also be interested to know what the differences are between the code and training commands provided in this github, and the one that was used to train the pretrained motorcycle model. For one, volume subdivision isn't implemented, but what else (e.g. R1 regularization, SDF regularization, single vs two discriminators)? The paper also says Adam beta = 0.9, but the code uses (0, 0.99) (!) which is puzzling.

SteveJunGao commented 11 months ago

Hi @pallgeuer, thanks for the great questions!

I think the divergence might be because the GAN is unstable to train. There're several opinions you can try to increase stability:

Re volume subdivision: Unfortunately, we didn't include the volume subdivision code in this codebase, our paper has the ablation studies when removing the volume subdivision (Table 2) and the released pre-trained model is trained without volume subdivision.

Re R1 regularization: the gamma we used is 80 for motorbikes.

Re SDF regularization: which regularization do you mean exactly? we do have one regularization in the paper (Eq 2) and the hyperparameter for this is fixed in the code for all the experiments

Re single v.s. two discriminators: we always use two discriminators in all the experiments (except when we do ablation studies to compare the effect of two discriminators v.s. a single discriminator)

Re Adam beta. We apologize for the typo in the paper, the (0, 0.99) in the code is the correct one.

pallgeuer commented 11 months ago

Hi, many thanks for the detailed answer.

My original diverging trainings were with a batch size of 64, which is close to the most I can fit on 2xA100 (96 fits but starts showing symptoms of hitting against the GPU memory limit). Is by any chance gradient accumulation implemented to allow higher batch sizes?

Was the pretrained model trained with --fp32=1?

Weirdly, a training run that set gamma=40 instead of gamma=80 was the first run that has made it to 6000kimg and is currently still converging.

I got the name "SDF regularization" from the paper:

We follow StyleGAN2 [35] and use lazy regularization, which applies R1 regularization to discriminators only every 16 training steps. Finally, we set the hyperparameter µ that controls the SDF regularization to 0.01 in all the experiments.

But yes, this is exactly the loss contribution described in Eqns 2 & 3 like you said.

Okay, so this github repo by default uses two discriminators when called with parameters like I specified?

Was the choice of Adam betas just inherited from another project, or did initial tests with beta1 >= 0.5 show that it had a detrimental effect? Was training with it unstable? Has a learning rate scheduler that reduces the learning rate over time been tested?

jingyang2017 commented 10 months ago

Hi, I meet the similar issue when I am trying to use this code to train on Chair, python train_3d.py --outdir='./results/' --data='/home/XXX/projects/XXX/Datasets/GET3D/ShapeNet/img/03001627' --camera_path /home/XXX/projects/XXX/Datasets/GET3D/ShapeNet/camera/ --gpus=8 --batch=32 --gamma=400 --data_camera_mode shapenet_chair --dmtet_scale 0.8 --use_shapenet_split 1 --one_3d_generator 1 --fp32 0

The following are the fid scores during training: {"results": {"fid50k": 243.90716463299503}, "metric": "fid50k", "total_time": 226.7174837589264, "total_time_str": "3m 47s", "num_gpus": 8, "snapshot_pkl": "network-snapshot-000000.pkl", "timestamp": 1693757250.5896306} {"results": {"fid50k": 85.12321071542893}, "metric": "fid50k", "total_time": 210.83970594406128, "total_time_str": "3m 31s", "num_gpus": 8, "snapshot_pkl": "network-snapshot-000204.pkl", "timestamp": 1693764915.7036355} {"results": {"fid50k": 47.563969561269744}, "metric": "fid50k", "total_time": 233.97653555870056, "total_time_str": "3m 54s", "num_gpus": 8, "snapshot_pkl": "network-snapshot-000409.pkl", "timestamp": 1693772572.6469076} {"results": {"fid50k": 42.0378087379054}, "metric": "fid50k", "total_time": 211.6967008113861, "total_time_str": "3m 32s", "num_gpus": 8, "snapshot_pkl": "network-snapshot-000614.pkl", "timestamp": 1693780185.751803} {"results": {"fid50k": 40.741863425884134}, "metric": "fid50k", "total_time": 211.70320200920105, "total_time_str": "3m 32s", "num_gpus": 8, "snapshot_pkl": "network-snapshot-000819.pkl", "timestamp": 1693787801.4226646} {"results": {"fid50k": 36.727746342948834}, "metric": "fid50k", "total_time": 211.39422988891602, "total_time_str": "3m 31s", "num_gpus": 8, "snapshot_pkl": "network-snapshot-001024.pkl", "timestamp": 1693795420.8913603} {"results": {"fid50k": 35.36935289811818}, "metric": "fid50k", "total_time": 211.84103798866272, "total_time_str": "3m 32s", "num_gpus": 8, "snapshot_pkl": "network-snapshot-001228.pkl", "timestamp": 1693803039.0559103} {"results": {"fid50k": 34.56291491733728}, "metric": "fid50k", "total_time": 213.0910336971283, "total_time_str": "3m 33s", "num_gpus": 8, "snapshot_pkl": "network-snapshot-001433.pkl", "timestamp": 1693810644.6885796} {"results": {"fid50k": 231.98312384110938}, "metric": "fid50k", "total_time": 228.0975947380066, "total_time_str": "3m 48s", "num_gpus": 8, "snapshot_pkl": "network-snapshot-001638.pkl", "timestamp": 1693818536.069743} {"results": {"fid50k": 218.79605704254513}, "metric": "fid50k", "total_time": 220.4409465789795, "total_time_str": "3m 40s", "num_gpus": 8, "snapshot_pkl": "network-snapshot-001843.pkl", "timestamp": 1693826533.7842581} {"results": {"fid50k": 183.94286614489346}, "metric": "fid50k", "total_time": 233.28316688537598, "total_time_str": "3m 53s", "num_gpus": 8, "snapshot_pkl": "network-snapshot-002048.pkl", "timestamp": 1693834372.6499712} The fid scores are unstable and the model tends to collapse after longer training.

I have also calculated the checkpoint provided in https://drive.google.com/drive/folders/1oJ-FmyVYjIwBZKDAQ4N1EEcE9dJjumdW, CUDA_VISIBLE_DEVICES=0 python train_3d.py --outdir=save_inference_results/shapenet_chair --gpus=1 --batch=32 --gamma=400 --data_camera_mode shapenet_chair --dmtet_scale 0.8 --use_shapenet_split 1 --one_3d_generator 1 --fp32 0 --inference_vis 1 --resume_pretrain weights/shapenet_chair.pt --inference_compute_fid 1 --data='/home/XXX/projects/XXX/Datasets/GET3D/ShapeNet/img/03001627' --camera_path /home/XXX/projects/XXX/Datasets/GET3D/ShapeNet/camera/

{"results": {"fid50k": 22.706035931177578}, "metric": "fid50k", "total_time": 1566.5149657726288, "total_time_str": "26m 07s", "num_gpus": 1, "snapshot_pkl": "weights/shapenet_chair.pt", "timestamp": 1693843925.6001537}

The best model I achieved is network-snapshot-001433.pkl which gets

{"results": {"fid50k": 28.708304000589685}, "metric": "fid50k", "total_time": 1631.6883997917175, "total_time_str": "27m 12s", "num_gpus": 1, "snapshot_pkl": "../../../results/00001-stylegan2-03001627-gpus8-batch32-gamma400/network-snapshot-001433.pt", "timestamp": 1693845837.4513173}

Is there any problem in my training setting?