Training time-consuming problem

michalfaber / keras_Realtime_Multi-Person_Pose_Estimation

Keras version of Realtime Multi-Person Pose Estimation project

Other

779 stars 372 forks source link

Training time-consuming problem #103

Open xujiafree opened 5 years ago

xujiafree commented 5 years ago

Hi @michalfaber ,Thanks for your work，but I encountered some problems in training process. During the trainning, I did not modify any parameters from the original code and used the same coco2017 dataset. However, I did not achieve the same loss as the on after your epoch 0 even after 50 epochs (the loss is still about 870 after 50 epochs). The specs of my GPU is two Titan XP if that helps for your reference. Thank you very much for any help and time here!

rosivagyok commented 5 years ago

Hi @michalfaber & @xujiafree , I am encountering the same issue on the latest release of the code. I didn't modify anything, except the batch size to 8 according to a previous issue here.

Even after 10 epoch my training loss is around 900:

...and my validation loss keeps increasing, indicating that the model starts to overfit after the first epoch:

I have run the COCO evaluation module on a previously trained model for 20 epochs with very similar loss values, to validate my findings, and I only got 0.15 mAP.

Did you encounter the same issue? I'm guessing that since the only thing that heavily changed after the previous release is the data augmentation and dataflow with tensorpack, the bug could be there (or maybe im wrong). I will try to debug the dataflow and look if I can find something.

Thank you very much for your kind reply!

michalfaber commented 5 years ago

Hi, @rosivagyok @xujiafree I made 2 errors in migrating training code from C++ to tensorpack. Images should be scaled only by the scale factor of the main person - not other (smaller) persons. Validation set files should be included in the training set. I've fixed these problems and now the first 5 epochs look like this:

screenshot from 2018-11-16 07-34-07

rosivagyok commented 5 years ago

Hi @michalfaber , thanks for your quick response! Indeed the training looks good now.

One question: Any particular reason you included the validation data as well for training? If we would like to include a validation step in the fit_generator, it is still possible to create a different dataflow just for validation.

michalfaber commented 5 years ago

Hi @rosivagyok The more unique training data the better. In the original implementation, they also use a validation set for training, except for the first 2645 images. I think that for this kind of networks, the loss values themselves give a better metric instead of accuracy based on the validation set - unless you have a large validation set.

rosivagyok commented 5 years ago

Hi @michalfaber For me, it doesn't make sense to use the validation split for training, because I'm calculating the main metric (mAP) on a mini batch of the coco val2017 split (500 images) after every epoch. Using the validation set for training in this case, would bias the results, since I would be giving samples to the network, that it has already "seen" in some form before.