Results Using Predicted 2D Keypoints

juxuan27 commented 2 years ago

Thank you so much for your excellent work! But I got some problems while trying to test the model on predicted 2D keypoints(using Alphapose-Fast Pose, same as the backbone mentioned in README) on the dataset 3dpw This is how I tried:

For the reason that it is pretty hard for me to map multi-pose results generated by Alphapose to 3DPW ground-truth, I selected videos with a single person. The list is as follows:

                   "courtyard_backpack_00",
                   "courtyard_basketball_01",
                   "courtyard_bodyScannerMotions_00",
                   "courtyard_box_00",
                   "courtyard_golf_00",
                   "courtyard_jacket_00",
                    "courtyard_jumpBench_01",
                   "courtyard_laceShoe_00",
                   "courtyard_relaxOnBench_00",
                   "courtyard_relaxOnBench_01",
                   "downtown_stairs_00",
                   "downtown_walkDownhill_00",
                   "flat_guitar_01",
                   "flat_packBags_00",
                   "outdoors_climbing_00",
                   "outdoors_climbing_01",
                   "outdoors_climbing_02",
                   "outdoors_crosscountry_00",
                   "outdoors_fencing_01",
                   "outdoors_freestyle_00",
                   "outdoors_freestyle_01",
                   "outdoors_golf_00",
                   "outdoors_parcours_00",
                   "outdoors_parcours_01",
                   "outdoors_slalom_00",
                   "outdoors_slalom_01",

Then, I ran the Internet video baseline and got predicted cam, rotmat, beta parameters for each frame.
After that, I calculate the MPJPE, PA-MPJPE, and PVE for each step.

The final results are as follows (plus with MPJPE on X, Y, Z axis):

metricMPJP	dynaBOA w gt 2d	dynaBOA w pred 2d
MPJPE	65.56047058105469	186.74376
PA-MPJPE	40.92316436767578	77.56925
PVE	83.11467019999202	195.08884
MPJPE（X Axis）	21.0639544272907	67.5
MPJPE（Y Axis）	25.5786684319053	57.8
MPJPE（Z Axis）	50.4342290491508	140.7

I was quite confused why the results would be so bad. So I tried to make Gaussian Perturbation on ground truth 2d. And run the 3dpw baseline. The code I changed is as follows.

https://github.com/syguan96/DynaBOA/blob/b8d2bbe9d8e827a36e72bb324a9a6e43f421ae31/boa_dataset/pw3d.py#L58

changed to (e.g. sigma=1):

self.smpl_j2ds.append(smpl_j2ds+np.random.normal(0, 1, size=tuple(smpl_j2ds.shape)))

And here is the result: Gaussian Perturbation on ground truth 2d

Furthermore, I calculate the mean-variance of ground truth 2d and Alphapose predicted 2d, and the result is 12.65. Take the assumption detected 2d is Gaussian noise added on ground truth 2d, the result is supposed to be worse.

So is that mean DynaBOA is not combination incorporable with detected 2d keypoints? Or is that because of my improper operation?

Thank you so much for your patience in reading my issue.

syguan96 commented 2 years ago

Hi @juxuan27, sorry for the late reply. I just finished a due. The experiment is very insightful and inspiring. I'm so glad to discuss with you.

In the results from alphapose, some joints will be lost. These results are not good supervision for network adaptation, especially for the online adaptation. Top-down methods are more appropriate indeed. And I think the assumption on the Gaussian noise is not very appropriate for results from bottom-up methods.

syguan96 commented 2 years ago

Also, the annotation gap will also influence the evaluation results.

syguan96 commented 2 years ago

The hyperparameters also should be tuned.

syguan96 commented 2 years ago

''I calculate the mean-variance of ground truth 2d and Alphapose predicted 2d, and the result is 12.65. '' Will you ignore the joint if it is missed?

juxuan27 commented 2 years ago

Thank you for your reply! To calculate the mean-variance of ground truth 2d and Alphapose predicted 2d, I have filtered out the missed and mismatched joints. But maybe the result is not completely accurate since Alphapose sometimes outputs more than 1 person's annotation for a single person image. In this situation, I calculate the one with minimum MPJPE. Also, I have found the mean value of ground truth 2d and Alphapose predicted 2d is about 0(maybe -0.0xxx, I forget the exact number). I agree with your idea that the results of Alphapose should not be assumpted as Gaussian noise. Because if it is Gaussian noise, the mean-variance of ground truth 2d and Alphapose predicted 2d is supposed to be between 1 and 1.5. I don't know whether there is a possibility that when input 2d ground truth of the present image, the model tends to overfit on the 2d annotation instead of the temporal information. Note that the lower-level optimization step and upper-level optimization step all have 2d ground truth in the loss function. What's more, I think maybe there is a need to conduct experiments on freezing the model after fine-tuning on the 3DPW train set, then directly inference on the 3DPW train set. This may help us further understand how it works 😄

juxuan27 commented 2 years ago

The hyperparameters also should be tuned.

Also, the annotation gap will also influence the evaluation results.

I agree hahhhh

syguan96 commented 2 years ago

I don't know whether there is a possibility that when input 2d ground truth of the present image, the model tends to overfit on the 2d annotation instead of the temporal information. Note that the lower-level optimization step and upper-level optimization step all have 2d ground truth in the loss function. What's more, I think maybe there is a need to conduct experiments on freezing the model after fine-tuning on the 3DPW train set, then directly inference on the 3DPW train set. This may help us further understand how it works 😄

Hi, I just finished the Spring Festival holiday. For point 1, refer to tab.7 (our paper), with temporal constraint, MPJPE and PVE have more significant improvement. And these metrics are tightly related to temporal correlation. So, I think with bilevel optimization, temporal and single-frame(gt kp2d) are auxiliary constraints. But, I agree single-frame is more important.

For point 2, if I finetune on the 3DPW training set, should GT 3D mesh/joints be used? In tab.4, I fine-tuned SPIN on 3DPW test set with GT 2D keypoints (termed as *SPIN). Also, I compared with other baselines which are fine-tuned on 3DPW training set. Please refer to this table for more details.

syguan96 commented 2 years ago

As for analyzing the noise distribution of each joint, I think using the results detected by AlphaPose is not appropriate. How to deal with missed joints is hard. Top-down methods may be a more appropriate alternative. This is my guess. Maybe we can chat by email or WeChat(shuishiguanshanyan).

juxuan27 commented 2 years ago

Thank you for your answer! I've added your WeChat!

syguan96 / DynaBOA

Results Using Predicted 2D Keypoints #13