Closed Dariushuangg closed 11 months ago
Hello,
(1) Using different number of GPUs during the model training might lead to different performance. (2) Please check whether all training datasets are correctly downloaded. (3) Please check whether all training settings are the same.
Note that a substantial number of noisy pseudo annotations might introduce instability during the model training, as also observed by METRO and MeshGraphormer; one could alleviate this issue with more accurate pseudo annotations provided by NeuralAnnot or EFT. Despite such instability, we never observed such a large performance drop in our experiments. Thus, please ensure the above (1), (2) and (3).
Thanks for your interest in our work!!
Please reopen this issue if you need more help regarding this.
In experiment.md, training H3.6M requires the input parameter --train_yaml Tax-H36m-coco40k-Muco-UP-Mpii/train.yaml
However, I can't find Tax-H36m-coco40k-Muco-UP-Mpii/train.yaml in the root directory and in the dataset folder, so I am using the train.yaml in the provided H3.6M folder for substitute. Could this be the cause of the mismatch? Could you provide me with the content of Tax-H36m-coco40k-Muco-UP-Mpii/train.yaml? Thanks.
Hmm, is it correct that in order to reproduce the evaluation results on H3.6M, I will need to train the model on H3.6M + coco_smpl + muco + up3d + mpii?
Yes, as described in Section 5.1 of the paper, you should use the mixed datasets for the model training. It seems that you only used the Human3.6M dataset for the training; it leads to the low performance.
OK, I downloaded and used all 5 datasets for training, and now the metrics looks correct. I do recommend changing the title in the experiment.md though, as it said training on Human3.6M but in fact uses the config file for mixed-dataset training; This could cause confusion.
Anyway, thx for your timely reply and I will close this issue.
Hi, I trained FastMETRO-L-H64 on H3.6M but only got this performance: INFO:FastMETRO:Best Results: (PA-MPJPE) \ 0.00 \ 75.36 \ 47.05 at Epoch 60.00
I tried evaluating the official checkpoint and got the same performance as published:
INFO:FastMETRO:Validation Epoch: 0 MPVPE: 0.00, MPJPE: 52.95, PA-MPJPE: 33.58
I didn't alter any hyperparameters, except that I am using 8 V100 GPUs: python3.8 -m torch.distributed.launch --nproc_per_node=8 --master_port=29502\ src/tools/run_fastmetro_bodymesh.py \ --arch hrnet-w64 \ --model_name FastMETRO-L \ --num_workers 4 \ --per_gpu_train_batch_size 16 \ --per_gpu_eval_batch_size 16 \ --lr 1e-4 \ --num_train_epochs 60 \ --output_dir FastMETRO-L-H64_h36m/
I did modified _run_fastmetrobodymesh.py by deleting all mesh visualization code. I am using backbone hrnetv2_w64_imagenet_pretrained.pth
Any clue on what could I be doing wrong? Thx