microsoft / MeshTransformer

Research code for CVPR 2021 paper "End-to-End Human Pose and Mesh Reconstruction with Transformers"
https://arxiv.org/abs/2012.09760
MIT License
607 stars 95 forks source link

Training on single dataset #72

Open fmx789 opened 1 year ago

fmx789 commented 1 year ago

Hi, Thank you for the great work! In the results section of your paper, you've stated results for your training on mixed datasets for 200 epochs. I attempted to train on the single 3dpw dataset from scratch but received unexpected results (as shown in the log below). I'd appreciate it if you could advise me how to solve this problem.

Thanks in advance.

2022-09-27 10:27:33,273 METRO INFO: Using 1 GPUs 2022-09-27 10:27:37,447 METRO INFO: Update config parameter num_hidden_layers: 12 -> 4 2022-09-27 10:27:37,447 METRO INFO: Update config parameter hidden_size: 768 -> 1024 2022-09-27 10:27:37,447 METRO INFO: Update config parameter num_attention_heads: 12 -> 4 2022-09-27 10:27:38,310 METRO INFO: Init model from scratch. 2022-09-27 10:27:38,310 METRO INFO: Update config parameter num_hidden_layers: 12 -> 4 2022-09-27 10:27:38,310 METRO INFO: Update config parameter hidden_size: 768 -> 256 2022-09-27 10:27:38,310 METRO INFO: Update config parameter num_attention_heads: 12 -> 4 2022-09-27 10:27:38,486 METRO INFO: Init model from scratch. 2022-09-27 10:27:38,486 METRO INFO: Update config parameter num_hidden_layers: 12 -> 4 2022-09-27 10:27:38,486 METRO INFO: Update config parameter hidden_size: 768 -> 128 2022-09-27 10:27:38,486 METRO INFO: Update config parameter num_attention_heads: 12 -> 4 2022-09-27 10:27:38,569 METRO INFO: Init model from scratch. 2022-09-27 10:27:40,009 METRO INFO: => loading hrnet-v2-w64 model 2022-09-27 10:27:40,012 METRO INFO: Transformers total parameters: 102256646 2022-09-27 10:27:40,016 METRO INFO: Backbone total parameters: 128059944 2022-09-27 10:27:40,216 METRO INFO: Training parameters Namespace(data_dir='datasets', train_yaml='pw3d_tsv_reproduce/train.yaml', val_yaml='pw3d_tsv_reproduce/test.yaml', num_workers=4, img_scale_factor=1, model_name_or_path='metro/modeling/bert/bert-base-uncased/', resume_checkpoint=None, output_dir='output/', config_name='', per_gpu_train_batch_size=20, per_gpu_eval_batch_size=30, lr=0.0001, num_train_epochs=30, vertices_loss_weight=100.0, joints_loss_weight=1000.0, vloss_w_full=0.33, vloss_w_sub=0.33, vloss_w_sub2=0.33, drop_out=0.1, arch='hrnet-w64', num_hidden_layers=4, hidden_size=128, num_attention_heads=4, intermediate_size=-1, input_feat_dim='2051,512,128', hidden_feat_dim='1024,256,128', legacy_setting=True, run_eval_only=False, logging_steps=1000, device=device(type='cuda'), seed=88, local_rank=0, num_gpus=1, distributed=False) 2022-09-27 10:37:39,084 METRO INFO: eta: 5:30:01 epoch: 0 iter: 1000 max mem : 19359 loss: 43.8094, 2d joint loss: 0.0363, 3d joint loss: 0.0242, vertex loss: 0.1603, compute: 0.5986, data: 0.0054, lr: 0.000100 2022-09-27 10:44:41,439 METRO INFO: Validation epoch: 1 mPVE: 216.89, mPJPE: 163.97, PAmPJPE: 110.12, Data Count: 35515.00 2022-09-27 10:53:16,153 METRO INFO: eta: 6:50:32 epoch: 1 iter: 2000 max mem : 19359 loss: 32.0019, 2d joint loss: 0.0250, 3d joint loss: 0.0167, vertex loss: 0.1277, compute: 0.7678, data: 0.1754, lr: 0.000100 2022-09-27 11:01:39,414 METRO INFO: Validation epoch: 2 mPVE: 213.65, mPJPE: 161.82, PAmPJPE: 105.72, Data Count: 35515.00 2022-09-27 11:08:53,971 METRO INFO: eta: 7:07:05 epoch: 2 iter: 3000 max mem : 19359 loss: 26.3174, 2d joint loss: 0.0201, 3d joint loss: 0.0134, vertex loss: 0.1088, compute: 0.8245, data: 0.2321, lr: 0.000100 2022-09-27 11:18:37,952 METRO INFO: Validation epoch: 3 mPVE: 204.17, mPJPE: 154.88, PAmPJPE: 102.00, Data Count: 35515.00 2022-09-27 11:24:28,939 METRO INFO: eta: 7:07:11 epoch: 3 iter: 4000 max mem : 19359 loss: 22.8643, 2d joint loss: 0.0172, 3d joint loss: 0.0115, vertex loss: 0.0963, compute: 0.8521, data: 0.2601, lr: 0.000100 2022-09-27 11:36:15,641 METRO INFO: Validation epoch: 4 mPVE: 182.91, mPJPE: 147.08, PAmPJPE: 96.03, Data Count: 35515.00 2022-09-27 11:36:17,768 METRO INFO: Save checkpoint to output/checkpoint-4-4544 2022-09-27 11:41:03,895 METRO INFO: eta: 7:06:50 epoch: 4 iter: 5000 max mem : 19359 loss: 20.4471, 2d joint loss: 0.0152, 3d joint loss: 0.0102, vertex loss: 0.0874, compute: 0.8807, data: 0.2837, lr: 0.000100 ...... 2022-09-27 18:08:18,140 METRO INFO: Validation epoch: 27 mPVE: 156.67, mPJPE: 136.94, PAmPJPE: 89.08, Data Count: 35515.00 2022-09-27 18:08:20,040 METRO INFO: Save checkpoint to output/checkpoint-27-30672 2022-09-27 18:11:35,319 METRO INFO: eta: 0:46:05 epoch: 27 iter: 31000 max mem : 19359 loss: 7.0103, 2d joint loss: 0.0049, 3d joint loss: 0.0030, vertex loss: 0.0350, compute: 0.8979, data: 0.3039, lr: 0.000010 2022-09-27 18:25:21,996 METRO INFO: Validation epoch: 28 mPVE: 157.29, mPJPE: 137.61, PAmPJPE: 88.35, Data Count: 35515.00 2022-09-27 18:25:23,883 METRO INFO: Save checkpoint to output/checkpoint-28-31808 2022-09-27 18:27:18,918 METRO INFO: eta: 0:31:10 epoch: 28 iter: 32000 max mem : 19359 loss: 6.8592, 2d joint loss: 0.0048, 3d joint loss: 0.0029, vertex loss: 0.0344, compute: 0.8993, data: 0.3053, lr: 0.000010 2022-09-27 18:42:27,939 METRO INFO: Validation epoch: 29 mPVE: 158.00, mPJPE: 137.04, PAmPJPE: 88.93, Data Count: 35515.00 2022-09-27 18:43:01,725 METRO INFO: eta: 0:16:12 epoch: 29 iter: 33000 max mem : 19359 loss: 6.7176, 2d joint loss: 0.0047, 3d joint loss: 0.0029, vertex loss: 0.0338, compute: 0.9006, data: 0.3065, lr: 0.000010 2022-09-27 18:53:01,465 METRO INFO: eta: 0:01:11 epoch: 29 iter: 34000 max mem : 19359 loss: 6.5830, 2d joint loss: 0.0046, 3d joint loss: 0.0028, vertex loss: 0.0333, compute: 0.8918, data: 0.2977, lr: 0.000010 2022-09-27 18:53:49,660 METRO INFO: eta: 0:00:00 epoch: 30 iter: 34080 max mem : 19359 loss: 6.5728, 2d joint loss: 0.0046, 3d joint loss: 0.0028, vertex loss: 0.0333, compute: 0.8911, data: 0.2970, lr: 0.000001 2022-09-27 18:59:31,358 METRO INFO: Validation epoch: 30 mPVE: 158.38, mPJPE: 137.46, PAmPJPE: 88.50, Data Count: 35515.00

imabackstabber commented 1 year ago

Looks like you are facing overfitting problem. I also tried training it for hand dataset by using freihand dataset alone, unluckily, it overfitted. I'm also wondering how to fix it.