michalfaber / keras_Realtime_Multi-Person_Pose_Estimation

Keras version of Realtime Multi-Person Pose Estimation project
Other
779 stars 372 forks source link

Looks like training process is broken (see second comment) #12

Open anatolix opened 6 years ago

anatolix commented 6 years ago

Hi.

I trained model bit further it is still training

screen shot 2017-10-23 at 13 59 58

Any idea what is the target loss should be? (I haven't found in the paper, only COCO results)

anatolix commented 6 years ago

Hi.

I tried to get answer for this myself, took converted model model/keras/model.h5, renamed it to weights.best.h5 and loaded it in next stage of training. Numbers show results of this model actually worse than my trained model. i.e. 11 vs 1.75 i.e. I mean keras process of training or data preparation is broken. Since testing process is obviously works I think it possible the data preparation.

loss: 11.8681 - weight_stage1_L1_loss: 0.0017 - weight_stage1_L2_loss: 0.0012 - weight_stage2_L1_loss: 0.0015 - weight_stage2_L2_loss: 0.0011 - weight_stage3_L1_loss: 0.0014 - weight_stage3_L2_loss: 0.0010 - weight_stage4_L1_loss: 0.0014 - weight_stage4_L2_loss: 0.0010 - weight_stage5_L1_loss: 0.0014 - weight_stage5_L2_loss: 0.0010 - weight_stage6_L1_loss: 0.0014 - weight_stage6_L2_loss: 0.0010 - weight_stage1_L1_acc: 0.0422 - weight_stage1_L2_acc: 0.9749 - weight_stage2_L1_acc: 0.0495 - weight_stage2_L2_acc: 0.9761 - weight_stage3_L1_acc: 0.0479 - weight_stage3_L2_acc: 0.9771 - weight_stage4_L1_acc: 0.8540 - weight_stage4_L2_acc: 0.9776 - weight_stage5_L1_acc: 0.8352 - weight_stage5_L2_acc: 0.9771 - weight_stage6_L1_acc: 0.0500 - weight_stage6_L2_acc: 0.9774 - val_loss: 11.8669 - val_weight_stage1_L1_loss: 0.0015 - val_weight_stage1_L2_loss: 0.0012 - val_weight_stage2_L1_loss: 0.0013 - val_weight_stage2_L2_loss: 0.0011 - val_weight_stage3_L1_loss: 0.0013 - val_weight_stage3_L2_loss: 0.0010 - val_weight_stage4_L1_loss: 0.0013 - val_weight_stage4_L2_loss: 0.0010 - val_weight_stage5_L1_loss: 0.0013 - val_weight_stage5_L2_loss: 0.0010 - val_weight_stage6_L1_loss: 0.0013 - val_weight_stage6_L2_loss: 0.0010 - val_weight_stage1_L1_acc: 0.0454 - val_weight_stage1_L2_acc: 0.9742 - val_weight_stage2_L1_acc: 0.0501 - val_weight_stage2_L2_acc: 0.9752 - val_weight_stage3_L1_acc: 0.0504 - val_weight_stage3_L2_acc: 0.9756 - val_weight_stage4_L1_acc: 0.8673 - val_weight_stage4_L2_acc: 0.9758 - val_weight_stage5_L1_acc: 0.8612 - val_weight_stage5_L2_acc: 0.9759 - val_weight_stage6_L1_acc: 0.0516 - val_weight_stage6_L2_acc: 0.9759

michalfaber commented 6 years ago

@anatolix Thanks for trying this path. Indeed, there are some issues with training process in keras. I will update the code once I find the problem.

michalfaber commented 6 years ago

@anatolix Problem solved.

anatolix commented 6 years ago

Thanks, will try today.

anatolix commented 6 years ago

It definitely works very good now. Thank you!

Could you please explain a bit about 'The network never sees the same image twice which was a problem in previous approach (tool rmpe_dataset_transformer)'?

Do you mean in rmpe_dataset_transformer we made data enough for 1 generation of training, and after that feed exactly same images in all generations while server re-transform everything each generation?

michalfaber commented 6 years ago

Exactly. The server performs a random transformation on every image.

ulzee commented 6 years ago

I have been running a similar experiment and I am not sure if what I'm seeing is intended. When i train an epoch on top of the downloadable keras trained model model.h5, this is the printout:

loss: 669.5428 - 

weight_stage1_L1_loss: 85.6600 - weight_stage1_L2_loss: 28.8309 - 
weight_stage2_L1_loss: 83.0361 - weight_stage2_L2_loss: 34.7485 - 
weight_stage3_L1_loss: 81.8874 - weight_stage3_L2_loss: 28.2328 - 
weight_stage4_L1_loss: 78.9033 - weight_stage4_L2_loss: 27.0499 - 
weight_stage5_L1_loss: 78.8889 - weight_stage5_L2_loss: 26.4100 - 
weight_stage6_L1_loss: 77.8298 - weight_stage6_L2_loss: 26.2129 - 

weight_stage1_L1_acc: 0.0615 - weight_stage1_L2_acc: 0.9690 - 
weight_stage2_L1_acc: 0.0684 - weight_stage2_L2_acc: 0.9697 - 
weight_stage3_L1_acc: 0.0681 - weight_stage3_L2_acc: 0.9701 - 
weight_stage4_L1_acc: 0.0682 - weight_stage4_L2_acc: 0.9701 - 
weight_stage5_L1_acc: 0.0681 - weight_stage5_L2_acc: 0.9703 - 
weight_stage6_L1_acc: 0.0685 - weight_stage6_L2_acc: 0.9703 - 

The losses line up with the screenshot in the readme as well as the caffe loss charts, but how should one interpret the L1 accuracies? Why are they so low compared to the L2 accuracies?

This was further odd considering with training the model ground up, I saw 50%+ L1 accuracies within the first epoch, leading me to believe something is inconsistent.

weight_stage1_L1_acc: 0.5193 - weight_stage1_L2_acc: 0.4426 - 
weight_stage2_L1_acc: 0.5447 - weight_stage2_L2_acc: 0.4446 - 
weight_stage3_L1_acc: 0.5456 - weight_stage3_L2_acc: 0.4404 - 
weight_stage4_L1_acc: 0.5447 - weight_stage4_L2_acc: 0.4436 - 
weight_stage5_L1_acc: 0.5490 - weight_stage5_L2_acc: 0.4427 - 
weight_stage6_L1_acc: 0.5466 - weight_stage6_L2_acc: 0.4427 

EDIT: (1) Did some more training and the produced model does seem to work fine. (2) Still not sure what the accuracies mean but I could not recreate the 50%+ accuracies so maybe it is supposed to be a small number?

piperod commented 6 years ago

Hi sorry to bother. I am getting a 'nan' in the loss function. Any idea on what I am doing wrong? Accuracy is reporting fine. Any help very welcome.

loss: nan - weight_stage1_L1_loss: nan - weight_stage1_L2_loss: nan - weight_stage2_L1_loss: nan - weight_stage2_L2_loss: nan - weight_stage3_L1_loss: nan - weight_stage3_L2_loss: nan - weight_stage4_L1_loss: nan - weight_stage4_L2_loss: nan - weight_stage5_L1_loss: nan - weight_stage5_L2_loss: nan - weight_stage6_L1_loss: nan - weight_stage6_L2_loss: nan -

ksaluja15 commented 6 years ago

@piperod assuming everything else is correct, did u increase the base learning rate ? Increasing the base lr leads to loss being nan

piperod commented 6 years ago

@ksaluja15 thanks for replying. I didn't touch the learning rate. However I did modify the parts since I don't have 17 annotations. Is this code in line 94 at generate_hdf5.py, related with x [part,0], y [part,1] and [partv,0] visibility? I think also my problem is that I might have to modify the augmenter too.

pers["joint"] = np.zeros((17, 3))
                for part in range(17):
                    pers["joint"][part, 0] = anno[part * 3]
                    pers["joint"][part, 1] = anno[part * 3 + 1]
                    if anno[part * 3 + 2] == 2:
                        pers["joint"][part, 2] = 1
                    elif anno[part * 3 + 2] == 1:
                        pers["joint"][part, 2] = 0
                    else:
                        pers["joint"][part, 2] = 2
anatolix commented 6 years ago

@piperod I've made fork with my custom augmenter https://github.com/anatolix/keras_Realtime_Multi-Person_Pose_Estimation and stuck into NaN problem too. Trying to solve now.

The idea of augmentation is the following, if you just delete some parts you could delete proper heatmaps and pafs. If you add or change parts you need new heatmaps and pafs

piperod commented 6 years ago

@anatolix thanks a lot, I will try your code asap. I was able to get ride of the nans, but I thinks is a very naive solution. I just basically put several copies of my 5 annotations in the 17 spots available in generate_hdf5.py lines 96-100. Still don't have results, but I guess maybe having a code more flexible in the number of annotations could be a good feature to add in the future.

anatolix commented 6 years ago

About accuracy: keras use binary classification accuracy if you just ask for 'accuracy' this metric is not relevant, I think it convert floats in pafs and heatmaps to booleans and compares them.

It is nothing like COCO or MPII accuracies which mentioned in original paper.

piperod commented 6 years ago

It is weird, it went well for about 500 epochs, but then nan is reported again. Just right before it happens, the loss explodes to infinity. captura de pantalla 2017-11-13 a la s 10 14 32 p m captura de pantalla 2017-11-13 a la s 10 14 40 p m

anatolix commented 6 years ago

@piperod I've fixed problem with NaNs in my code https://github.com/anatolix/keras_Realtime_Multi-Person_Pose_Estimation/commit/9e5adc4d4af64b642562882cedf9e30cbf00ed05

and think it could be caused by exactly same mechanism in C++ code. Look at the code https://github.com/michalfaber/rmpe_dataset_server/blob/master/DataTransformer.cpp

  float norm_bc = sqrt(bc.x*bc.x + bc.y*bc.y);
  bc.x = bc.x /norm_bc;
  bc.y = bc.y /norm_bc;

If norm_bc is 0 (body parts for start PAF and end PAF in same place, i.e. hand is perpendicular to image plane, for example directed to the camera) it produces NaN which kill whole neural network instantly.

to figure it out just put assert in training code for NaNs in input.

piperod commented 6 years ago

@anatolix You were right. The fact that I was hardcoding wrongly the new parts were causing to get these divisions by 0. However I was able to modify the joints and the mapping for the skeletons (datatransformer.cpp) to match my annotations and so far is working. The loss has dropped from 2000 to 145 in the first 1600 epochs. Haven't test yet the model, but seems like a good start. I still figuring out what was the error with the python code you provided, as soon as I have results will let you know. Thanks for the patience and the help.

piperod commented 6 years ago

Just a little update. Its working great! Loss function is working now, and predictions are close to what is shown in the demo. captura de pantalla 2017-11-14 a la s 11 30 15 p m