Open anatolix opened 7 years ago
Hi.
I tried to get answer for this myself, took converted model model/keras/model.h5, renamed it to weights.best.h5 and loaded it in next stage of training. Numbers show results of this model actually worse than my trained model. i.e. 11 vs 1.75 i.e. I mean keras process of training or data preparation is broken. Since testing process is obviously works I think it possible the data preparation.
loss: 11.8681 - weight_stage1_L1_loss: 0.0017 - weight_stage1_L2_loss: 0.0012 - weight_stage2_L1_loss: 0.0015 - weight_stage2_L2_loss: 0.0011 - weight_stage3_L1_loss: 0.0014 - weight_stage3_L2_loss: 0.0010 - weight_stage4_L1_loss: 0.0014 - weight_stage4_L2_loss: 0.0010 - weight_stage5_L1_loss: 0.0014 - weight_stage5_L2_loss: 0.0010 - weight_stage6_L1_loss: 0.0014 - weight_stage6_L2_loss: 0.0010 - weight_stage1_L1_acc: 0.0422 - weight_stage1_L2_acc: 0.9749 - weight_stage2_L1_acc: 0.0495 - weight_stage2_L2_acc: 0.9761 - weight_stage3_L1_acc: 0.0479 - weight_stage3_L2_acc: 0.9771 - weight_stage4_L1_acc: 0.8540 - weight_stage4_L2_acc: 0.9776 - weight_stage5_L1_acc: 0.8352 - weight_stage5_L2_acc: 0.9771 - weight_stage6_L1_acc: 0.0500 - weight_stage6_L2_acc: 0.9774 - val_loss: 11.8669 - val_weight_stage1_L1_loss: 0.0015 - val_weight_stage1_L2_loss: 0.0012 - val_weight_stage2_L1_loss: 0.0013 - val_weight_stage2_L2_loss: 0.0011 - val_weight_stage3_L1_loss: 0.0013 - val_weight_stage3_L2_loss: 0.0010 - val_weight_stage4_L1_loss: 0.0013 - val_weight_stage4_L2_loss: 0.0010 - val_weight_stage5_L1_loss: 0.0013 - val_weight_stage5_L2_loss: 0.0010 - val_weight_stage6_L1_loss: 0.0013 - val_weight_stage6_L2_loss: 0.0010 - val_weight_stage1_L1_acc: 0.0454 - val_weight_stage1_L2_acc: 0.9742 - val_weight_stage2_L1_acc: 0.0501 - val_weight_stage2_L2_acc: 0.9752 - val_weight_stage3_L1_acc: 0.0504 - val_weight_stage3_L2_acc: 0.9756 - val_weight_stage4_L1_acc: 0.8673 - val_weight_stage4_L2_acc: 0.9758 - val_weight_stage5_L1_acc: 0.8612 - val_weight_stage5_L2_acc: 0.9759 - val_weight_stage6_L1_acc: 0.0516 - val_weight_stage6_L2_acc: 0.9759
@anatolix Thanks for trying this path. Indeed, there are some issues with training process in keras. I will update the code once I find the problem.
@anatolix Problem solved.
Thanks, will try today.
It definitely works very good now. Thank you!
Could you please explain a bit about 'The network never sees the same image twice which was a problem in previous approach (tool rmpe_dataset_transformer)'?
Do you mean in rmpe_dataset_transformer we made data enough for 1 generation of training, and after that feed exactly same images in all generations while server re-transform everything each generation?
Exactly. The server performs a random transformation on every image.
I have been running a similar experiment and I am not sure if what I'm seeing is intended. When i train an epoch on top of the downloadable keras trained model model.h5
, this is the printout:
loss: 669.5428 -
weight_stage1_L1_loss: 85.6600 - weight_stage1_L2_loss: 28.8309 -
weight_stage2_L1_loss: 83.0361 - weight_stage2_L2_loss: 34.7485 -
weight_stage3_L1_loss: 81.8874 - weight_stage3_L2_loss: 28.2328 -
weight_stage4_L1_loss: 78.9033 - weight_stage4_L2_loss: 27.0499 -
weight_stage5_L1_loss: 78.8889 - weight_stage5_L2_loss: 26.4100 -
weight_stage6_L1_loss: 77.8298 - weight_stage6_L2_loss: 26.2129 -
weight_stage1_L1_acc: 0.0615 - weight_stage1_L2_acc: 0.9690 -
weight_stage2_L1_acc: 0.0684 - weight_stage2_L2_acc: 0.9697 -
weight_stage3_L1_acc: 0.0681 - weight_stage3_L2_acc: 0.9701 -
weight_stage4_L1_acc: 0.0682 - weight_stage4_L2_acc: 0.9701 -
weight_stage5_L1_acc: 0.0681 - weight_stage5_L2_acc: 0.9703 -
weight_stage6_L1_acc: 0.0685 - weight_stage6_L2_acc: 0.9703 -
The losses line up with the screenshot in the readme as well as the caffe loss charts, but how should one interpret the L1 accuracies? Why are they so low compared to the L2 accuracies?
This was further odd considering with training the model ground up, I saw 50%+ L1 accuracies within the first epoch, leading me to believe something is inconsistent.
weight_stage1_L1_acc: 0.5193 - weight_stage1_L2_acc: 0.4426 -
weight_stage2_L1_acc: 0.5447 - weight_stage2_L2_acc: 0.4446 -
weight_stage3_L1_acc: 0.5456 - weight_stage3_L2_acc: 0.4404 -
weight_stage4_L1_acc: 0.5447 - weight_stage4_L2_acc: 0.4436 -
weight_stage5_L1_acc: 0.5490 - weight_stage5_L2_acc: 0.4427 -
weight_stage6_L1_acc: 0.5466 - weight_stage6_L2_acc: 0.4427
EDIT: (1) Did some more training and the produced model does seem to work fine. (2) Still not sure what the accuracies mean but I could not recreate the 50%+ accuracies so maybe it is supposed to be a small number?
Hi sorry to bother. I am getting a 'nan' in the loss function. Any idea on what I am doing wrong? Accuracy is reporting fine. Any help very welcome.
loss: nan - weight_stage1_L1_loss: nan - weight_stage1_L2_loss: nan - weight_stage2_L1_loss: nan - weight_stage2_L2_loss: nan - weight_stage3_L1_loss: nan - weight_stage3_L2_loss: nan - weight_stage4_L1_loss: nan - weight_stage4_L2_loss: nan - weight_stage5_L1_loss: nan - weight_stage5_L2_loss: nan - weight_stage6_L1_loss: nan - weight_stage6_L2_loss: nan -
@piperod assuming everything else is correct, did u increase the base learning rate ? Increasing the base lr leads to loss being nan
@ksaluja15 thanks for replying. I didn't touch the learning rate. However I did modify the parts since I don't have 17 annotations. Is this code in line 94 at generate_hdf5.py, related with x [part,0], y [part,1] and [partv,0] visibility? I think also my problem is that I might have to modify the augmenter too.
pers["joint"] = np.zeros((17, 3))
for part in range(17):
pers["joint"][part, 0] = anno[part * 3]
pers["joint"][part, 1] = anno[part * 3 + 1]
if anno[part * 3 + 2] == 2:
pers["joint"][part, 2] = 1
elif anno[part * 3 + 2] == 1:
pers["joint"][part, 2] = 0
else:
pers["joint"][part, 2] = 2
@piperod I've made fork with my custom augmenter https://github.com/anatolix/keras_Realtime_Multi-Person_Pose_Estimation and stuck into NaN problem too. Trying to solve now.
The idea of augmentation is the following, if you just delete some parts you could delete proper heatmaps and pafs. If you add or change parts you need new heatmaps and pafs
@anatolix thanks a lot, I will try your code asap. I was able to get ride of the nans, but I thinks is a very naive solution. I just basically put several copies of my 5 annotations in the 17 spots available in generate_hdf5.py lines 96-100. Still don't have results, but I guess maybe having a code more flexible in the number of annotations could be a good feature to add in the future.
About accuracy: keras use binary classification accuracy if you just ask for 'accuracy' this metric is not relevant, I think it convert floats in pafs and heatmaps to booleans and compares them.
It is nothing like COCO or MPII accuracies which mentioned in original paper.
It is weird, it went well for about 500 epochs, but then nan is reported again. Just right before it happens, the loss explodes to infinity.
@piperod I've fixed problem with NaNs in my code https://github.com/anatolix/keras_Realtime_Multi-Person_Pose_Estimation/commit/9e5adc4d4af64b642562882cedf9e30cbf00ed05
and think it could be caused by exactly same mechanism in C++ code. Look at the code https://github.com/michalfaber/rmpe_dataset_server/blob/master/DataTransformer.cpp
float norm_bc = sqrt(bc.x*bc.x + bc.y*bc.y);
bc.x = bc.x /norm_bc;
bc.y = bc.y /norm_bc;
If norm_bc is 0 (body parts for start PAF and end PAF in same place, i.e. hand is perpendicular to image plane, for example directed to the camera) it produces NaN which kill whole neural network instantly.
to figure it out just put assert in training code for NaNs in input.
@anatolix You were right. The fact that I was hardcoding wrongly the new parts were causing to get these divisions by 0. However I was able to modify the joints and the mapping for the skeletons (datatransformer.cpp) to match my annotations and so far is working. The loss has dropped from 2000 to 145 in the first 1600 epochs. Haven't test yet the model, but seems like a good start. I still figuring out what was the error with the python code you provided, as soon as I have results will let you know. Thanks for the patience and the help.
Just a little update. Its working great! Loss function is working now, and predictions are close to what is shown in the demo.
Hi.
I trained model bit further it is still training
Any idea what is the target loss should be? (I haven't found in the paper, only COCO results)