Training on H36M - Githubissues

xuyanyu-shh commented 5 years ago

Thanks for your work and the data preparation. Very thanks.

Now, I am re-training your Fully Supervised + resnet50 on H36M, with pre-trained weight (MPII Integral | resnet50 | 88.5 ), using the following codes: python scripts/train.py --cfg experiments/h36m/train.yaml

Here, I have some questions. The first one is what is the meaning of 'OCCLUSION' in train.yaml. The original one is 'True'. But I can not train successfully. Thus, I change it to 'false', owing to the lack of related files. Could you tell its effect and whether use it in your provided model? In addition, would you update the related files?

The second one is the poor evaluation result on the validation set. At the first epoch, the value of Validation-hm36_17j is 77.43, 83.9 89.3 (I trained this model three times). At the fifth epoch, the value of that increases to 392.8, 349.4, 287.1. I want to know whether it is normal? Maybe I should wait a long time. Or there is something wrong about it? Just like the turn-off of OCCLUSION?

Thanks very much! :)

mkocabas commented 5 years ago

Hi @xuyanyu-shh,

OCCLUSION parameter indicates whether to use synthetic-occlusion augmentation during training. If you download Pascal VOC and update the VOC parameter in train.yaml, you can use it. It increased the MPJPE around 2-3 mm in some of our experiments and our pretrained models are trained using synthetic-occlusion. You are right, we should indicate that in readme.

The result of the 1st epoch seems normal, but there may be a problem with the 5th epoch. MPJPE should decrease consistently during training. If you send the log file, I can take a closer look at it.

xuyanyu-shh commented 5 years ago

Thank your reply very much!

I have turned on the OCCLUSION and retrain a new model.

By the way, the following is the log file: train_2019-03-08-07-13_train.log

Thanks very much! :)

mkocabas commented 5 years ago

There is something definitely wrong. I will check it out.

Could you please send your current configuration i.e. python and pytorch versions?

xuyanyu-shh commented 5 years ago

My current configuration: Python 3.6.7 Pytorch 0.4.1 CUDA 9.0

I think the pytorch version might be the main reason? I will update my pytorch to v1.0 and re-train it.

Thanks very much! :)

mkocabas commented 5 years ago

Yes, I assume that pytorch version is the main problem. Batchnorm cudnn implementation is somewhat problematic in pytorch 0.4 versions. You should upgrade it to v1.0.

Otherwise, you should follow the instructions at https://github.com/Microsoft/human-pose-estimation.pytorch to disable cudnn's implementations of BatchNorm layer.

mkocabas commented 5 years ago

@xuyanyu-shh, I updated the train-fs.pkl file under the data/h36m with the correct one (~49K samples). You should download the data.zip again. This will shorten your epoch times without changing the final accuracy.

I suggest you to use the new one, because the older one is unnecessarily large (~80K).

Also, can you donwloaded the pretrained mpii model again? I uploaded a new one.

mkocabas commented 5 years ago

I am closing the issue since the problem seems to be solved. Feel free to reopen it if you encounter any problem.

xuyanyu-shh commented 5 years ago

Here is a new log, which is more normal, with a trend (84.919, 77.237, 73.315, 125.37). The only difference is OCCLUSION. I switch it to true. train_2019-03-08-11-27_train.log

I also trained a new model under pytorch 1.0 version with a trend (83.813, 75.4081, 77.281, 70.795, 72.753, 68.220, ..., 57.63 (epoch48)). train_2019-03-08-12-34_train.log

Maybe, you are right. Pytorch version is the main problem. If having some new results, I will update it in time. Thank you very much from the bottom of my heart! :)

mkocabas / EpipolarPose

Training on H36M #1