Replicating GT/SH in Table 1

matteorr commented 6 years ago

I have a question about the experiment GT/SH described in Table 1. If I understand correctly this is the setup:

Train input is 16 (i.e. 17 - neck) ground truth 2D keypoints using all the videos from subjects 1,5,6,7,8.
Test input is 16 stacked hourglass 2D keypoints using all the videos from subjects 9 and all the videos except Directions from subject 11.
Output for both train and test is 17 3D keypoints.
Protocol #2 is used, which corresponds to using all frames, all cameras, averaging over the actions, and using Procrustes alignment.

In order to replicate the above setup I changed the following lines in your code from:

# Read stacked hourglass 2D predictions if use_sh, otherwise use groundtruth 2D projections
if FLAGS.use_sh:
  train_set_2d, test_set_2d, data_mean_2d, data_std_2d, dim_to_ignore_2d, dim_to_use_2d = data_utils.read_2d_predictions(actions, FLAGS.data_dir)
else:
  train_set_2d, test_set_2d, data_mean_2d, data_std_2d, dim_to_ignore_2d, dim_to_use_2d = data_utils.create_2d_data( actions, FLAGS.data_dir, rcams )

to

# Use GT for train and SH for test
train_set_2d, _, data_mean_2d, data_std_2d, dim_to_ignore_2d, dim_to_use_2d = data_utils.create_2d_data( actions, FLAGS.data_dir, rcams )
_, test_set_2d, _, _, _, _  = data_utils.read_2d_predictions(actions, FLAGS.data_dir)

so I can train on GT and test on SH (i.e. GT/SH) instead of SH/SH or GT/GT.

However, I'm having hard time getting 60.52, since at the first epoch the error is 63.95 and it grows instead of decreasing, i.e. at epoch 5 it is 66.58.

I used the following command to train: python src/predict_3dpose.py --camera_frame --residual --batch_norm --dropout 0.5 --max_norm --evaluateActionWise --use_sh --epochs 100

Thanks a lot for any help on this matter!

una-dinosauria commented 6 years ago

Sorry, when you say that "it starts at 63.95", are you referring to the Average error in the validation set?
What OS and TF version do you have?

matteorr commented 6 years ago

Thanks for your prompt reply!

1) I'm referring to the average error printed at line 250. Everything else in the code is unchanged, so I believe the error is computed on the test set.

2) I'm using tensorflow 1.8.0 on Ubuntu 16.04.3 LTS xenial.

una-dinosauria commented 6 years ago

I think you forgot to add --procrustes to the last command. I am getting 63.8 in the first epoch and I can confirm that the error is increasing. I'll investigate this further.

matteorr commented 6 years ago

Yeah, I didn't include it in the last command but can confirm that I used it in the code I launched. Thanks a lot again!

una-dinosauria commented 6 years ago

Hi @matteorr,

I ran these through my colleagues and @rayat137 noticed that, when you call

_, test_set_2d, _, _, _, _  = data_utils.read_2d_predictions(actions, FLAGS.data_dir)

The 2d test set comes normalized with SH statistics. In other words, the training data is being normalized with GT statistics, and the test set is being normalized with SH statistics. This might explain the increase in error as training progresses.

Could you please give it a try after correcting that error?

Cheers,

matteorr commented 6 years ago

Hi, and thanks for looking into this.

I followed your suggestion and replaced these lines (from my post above):

train_set_2d, _, data_mean_2d, data_std_2d, dim_to_ignore_2d, dim_to_use_2d = data_utils.create_2d_data( actions, FLAGS.data_dir, rcams )
_, test_set_2d, _, _, _, _  = data_utils.read_2d_predictions(actions, FLAGS.data_dir)

with the following ones:

train_set_2d, _, _, _, _, _ = data_utils.create_2d_data( actions, FLAGS.data_dir, rcams )
_, test_set_2d, data_mean_2d, data_std_2d, dim_to_ignore_2d, dim_to_use_2d  = data_utils.read_2d_predictions(actions, FLAGS.data_dir)

Now, if I understand correctly, the variables data_mean_2d, data_std_2d, dim_to_ignore_2d, dim_to_use_2d which contain all the information that is used for normalization and gets used in the evaluate_batches function will come from the SH data.

However, even though I get slightly different numbers, I still observe the same trend. For both cases, the avg error in mm starts at around 64mm and converges to around 75mm after 200 epochs.

Here is how the test losses looks like for both cases (no smoothing):

screenshot from 2018-05-21 09-01-50

screenshot from 2018-05-21 09-00-02

I didn't change anything else from the repo code, and I am able to get the numbers you suggest in the readme when running the demo, so I think the code base it correct.

una-dinosauria commented 6 years ago

Hi @matteorr,

Please note that the functions data_utils.create_2d_data and data_utils.read_2d_predictions return the data already normalized -- I should've probably chosen better names for them.

So with the changes that you just posted,

the training data is being normalized with GT statistics, and the test data is being normalized with SH statistics. What you want is
the training and the testing data to be both normalized with GT statistics

To correct the mistake you probably want to change data_utils.read_2d_predictions to return the data unnormalized, and then normalize it with the SH GT statistics that create_2d_data returns.

Sorry again for the misunderstanding.

EDIT: Typo. Changed SH for GT.

una-dinosauria commented 6 years ago

Hi @matteorr,

I tried to reproduce the GT/SH values myself, and I've realized that this is far from trivial for someone not familiar with the codebase. I've added the commit https://github.com/una-dinosauria/3d-pose-baseline/commit/22659b62b0ad187db9ff1b8e202c375140ca40d0 to a branch for this issue. As you can see, I've changed the way the data is loaded to account for the normalization differences.

However -- and crucially -- I am also using the fine-tuned SH detections.

With python src/predict_3dpose.py --camera_frame --residual --batch_norm --dropout 0.5 --max_norm --evaluateActionWise --use_sh --epochs 100 --procrustes, I am getting 54 mm of error in the first epoch, and converging toward ~51 after 5 epochs.

With python src/predict_3dpose.py --camera_frame --residual --batch_norm --dropout 0.5 --max_norm --evaluateActionWise --use_sh --epochs 100 --procrustes --predict_14 I get around 57 mm of error in the first epoch and 55 in the second one -- perhaps because the 14-joint subset has more variance than the 17-joint set.

Sorry again for this. We should have clearly made it easier to reproduce those results.

I hope you find the code useful, and please let me know if you notice any issues.

Cheers,

matteorr commented 6 years ago

No reason to apologize, actually thanks for the new code basis (I'll take a look at it over the next few days) and for being so responsive.

I have a couple more questions just to make sure I'm doing an apples to apples comparison:

Is there a reason you're using the fine-tuned SH detections, and do you remember if those are the ones you used in Table 1?
The use_sh flag is not needed any longer in the new branch, correct?
For the Table 1 experiment the predict_14 flag should be False, correct?

Thanks a lot again.

una-dinosauria commented 6 years ago

Hi @matteorr,

Is there a reason you're using the fine-tuned SH detections, and do you remember if those are the ones you used in Table 1?

When trying to reproduce the number I failed to achieve good results with the SH detections without fine-tuning (probably same as you have observed). The detections do not give super different results under other protocols, so there is probably a subtle bug here that I'm missing.

I don't remember using the FT detections. We reported the original results in the first arxiv version of our paper, and we had not fine-tuned the detector back then. Probably this is why we are getting slightly better results now with FT detections.

The use_sh flag is not needed any longer in the new branch, correct?

I think so. I actually believe it is not necessary after the change that you suggested either.

For the Table 1 experiment the predict_14 flag should be False, correct?

IIRC, when we were running those experiments we should have used 14 joints, because that is the protocol that Moreno-Noguer used. 1:1 comparisons should control for protocol as much as possible IMO.

Aside: I just tried to reproduce the GT+Noise experiments (https://github.com/una-dinosauria/3d-pose-baseline/tree/issue-60/table-1-noise if you are interested) and this seems to be the case.

Thanks again for checking the reproducibility of our work.

meijieru commented 5 years ago

@una-dinosauria Have you solve this problem now? I am also interested in the result.

When trying to reproduce the number I failed to achieve good results with the SH detections without fine-tuning

Could you please tell me what's current value of this setting?

una-dinosauria commented 5 years ago

Hi @meijieru,

No I haven't solved this problem yet -- I haven't had time to debug that one number in our paper. I also cannot run this things right now (I'm rushing towards the deadline too!) but maybe @matteorr's BMVC paper has the number that you want.

Good luck with the deadline.

meijieru commented 5 years ago

Thanks for your kindness. Good luck to you too.

matteorr commented 5 years ago

@una-dinosauria, thanks for the pointer to my work.

@meijieru, Fig. 4(c) in our paper summarizes the performance when training on GT and testing on SH detections.

Here's the github page if you'd like to find the code and additional details.

Nicholasli1995 commented 5 years ago

I have a similar problem. I tried to replicate Ours (SH detections) (MA) in Table 2 (67.5mm) but failed. I can only get 72mm. The same model can replicate 45.5mm for Ours (GT detections) (MA) so I think the problem lies in the 2D data. I'm using the provided SH detection data that is not fine-tuned on Human3.6M.

I also noticed that there is no spine joint in the provided SH detection, is it OK? It seems data_utils.py assumes there is spine in the SH detection: SH_NAMES = ['']*16 SH_NAMES[0] = 'RFoot' SH_NAMES[1] = 'RKnee' SH_NAMES[2] = 'RHip' SH_NAMES[3] = 'LHip' SH_NAMES[4] = 'LKnee' SH_NAMES[5] = 'LFoot' SH_NAMES[6] = 'Hip' SH_NAMES[7] = 'Spine' SH_NAMES[8] = 'Thorax' SH_NAMES[9] = 'Head' SH_NAMES[10] = 'RWrist' SH_NAMES[11] = 'RElbow' SH_NAMES[12] = 'RShoulder' SH_NAMES[13] = 'LShoulder' SH_NAMES[14] = 'LElbow' SH_NAMES[15] = 'LWrist'

I have checked the provided SH detections and found the joint configuration is actually MPII: 0:right_foot 1:right_knee 2:right_hip 3:left_hip 4:left_knee 5:left_foot 6:pelvis 7:thorax 8:upper_neck 9:head_top 10:r_wrist 11:r_elbow 12:r_shoulder 13:l_shoulder 14:l_elbow 15:l_wrist

una-dinosauria commented 5 years ago

@Nicholasli1995 Oh maybe this could explain the discrepancy. Our code checks for names correspondences with the H3.6M skeleton as it needs those to permute the data -- have you tried correcting the correspondence and seeing if this fixes the issue?

una-dinosauria commented 5 years ago

@Nicholasli1995 Upon re-reading your comment, I think you are referring to another table and another experiment. Could you please open another issue for that?

Cheers,

Nicholasli1995 commented 5 years ago

@Nicholasli1995 Upon re-reading your comment, I think you are referring to another table and another experiment. Could you please open another issue for that?

Cheers,

Thanks for the reply and I did not open another issue because I'm using the linked PyTorch implementation and I just want to ask a few questions:

What is the expected error (in mm) using the provided stackedhourglass detections (not fine-tuned on Human 3.6M). In the paper it was 67.5mm.
In the provided 2D detections it seems there is no "spine" joint but data_utils.py assumes there is "spine" joint. In fact, I found "thorax", "neck" and "head" joints in the provided detections which are consistent with MPII joint format. I'm confused about this discrepancy.

I'm using a PyTorch implementaion and can replicate the 45.5mm error for ground truth data. However, I cannot reproduce the 67.5mm error (I got 72-73 mm) thus I think there might be some data processing problem.

Thank you again for your help.

una-dinosauria commented 5 years ago

@Nicholasli1995 I do not maintain the Pytorch implementation. Please open an issue in the corresponding repository.

The expected number should be the one in the paper.

una-dinosauria / 3d-pose-baseline

Replicating GT/SH in Table 1 #60