syguan96 / DynaBOA

[T-PAMI 2022] Out-of-Domain Human Mesh Reconstruction via Dynamic Bilevel Online Adaptation
225 stars 19 forks source link

Results Using Predicted 2D Keypoints #13

Closed juxuan27 closed 2 years ago

juxuan27 commented 2 years ago

Thank you so much for your excellent work! But I got some problems while trying to test the model on predicted 2D keypoints(using Alphapose-Fast Pose, same as the backbone mentioned in README) on the dataset 3dpw This is how I tried:

The final results are as follows (plus with MPJPE on X, Y, Z axis):

metricMPJP dynaBOA w gt 2d dynaBOA w pred 2d
MPJPE 65.56047058105469 186.74376
PA-MPJPE 40.92316436767578 77.56925
PVE 83.11467019999202 195.08884
MPJPE(X Axis) 21.0639544272907 67.5  
MPJPE(Y Axis) 25.5786684319053 57.8
MPJPE(Z Axis) 50.4342290491508 140.7

I was quite confused why the results would be so bad. So I tried to make Gaussian Perturbation on ground truth 2d. And run the 3dpw baseline. The code I changed is as follows.

https://github.com/syguan96/DynaBOA/blob/b8d2bbe9d8e827a36e72bb324a9a6e43f421ae31/boa_dataset/pw3d.py#L58

changed to (e.g. sigma=1):

self.smpl_j2ds.append(smpl_j2ds+np.random.normal(0, 1, size=tuple(smpl_j2ds.shape)))

And here is the result: Gaussian Perturbation on ground truth 2d

Furthermore, I calculate the mean-variance of ground truth 2d and Alphapose predicted 2d, and the result is 12.65. Take the assumption detected 2d is Gaussian noise added on ground truth 2d, the result is supposed to be worse.

So is that mean DynaBOA is not combination incorporable with detected 2d keypoints? Or is that because of my improper operation?

Thank you so much for your patience in reading my issue.

syguan96 commented 2 years ago

Hi @juxuan27, sorry for the late reply. I just finished a due. The experiment is very insightful and inspiring. I'm so glad to discuss with you.

In the results from alphapose, some joints will be lost. These results are not good supervision for network adaptation, especially for the online adaptation. Top-down methods are more appropriate indeed. And I think the assumption on the Gaussian noise is not very appropriate for results from bottom-up methods.

syguan96 commented 2 years ago

Also, the annotation gap will also influence the evaluation results.

syguan96 commented 2 years ago

The hyperparameters also should be tuned.

syguan96 commented 2 years ago

''I calculate the mean-variance of ground truth 2d and Alphapose predicted 2d, and the result is 12.65. '' Will you ignore the joint if it is missed?

juxuan27 commented 2 years ago

Thank you for your reply! To calculate the mean-variance of ground truth 2d and Alphapose predicted 2d, I have filtered out the missed and mismatched joints. But maybe the result is not completely accurate since Alphapose sometimes outputs more than 1 person's annotation for a single person image. In this situation, I calculate the one with minimum MPJPE. Also, I have found the mean value of ground truth 2d and Alphapose predicted 2d is about 0(maybe -0.0xxx, I forget the exact number). I agree with your idea that the results of Alphapose should not be assumpted as Gaussian noise. Because if it is Gaussian noise, the mean-variance of ground truth 2d and Alphapose predicted 2d is supposed to be between 1 and 1.5. I don't know whether there is a possibility that when input 2d ground truth of the present image, the model tends to overfit on the 2d annotation instead of the temporal information. Note that the lower-level optimization step and upper-level optimization step all have 2d ground truth in the loss function. What's more, I think maybe there is a need to conduct experiments on freezing the model after fine-tuning on the 3DPW train set, then directly inference on the 3DPW train set. This may help us further understand how it works 😄

juxuan27 commented 2 years ago

The hyperparameters also should be tuned.

Also, the annotation gap will also influence the evaluation results.

I agree hahhhh

syguan96 commented 2 years ago

I don't know whether there is a possibility that when input 2d ground truth of the present image, the model tends to overfit on the 2d annotation instead of the temporal information. Note that the lower-level optimization step and upper-level optimization step all have 2d ground truth in the loss function. What's more, I think maybe there is a need to conduct experiments on freezing the model after fine-tuning on the 3DPW train set, then directly inference on the 3DPW train set. This may help us further understand how it works 😄

Hi, I just finished the Spring Festival holiday. For point 1, refer to tab.7 (our paper), with temporal constraint, MPJPE and PVE have more significant improvement. And these metrics are tightly related to temporal correlation. So, I think with bilevel optimization, temporal and single-frame(gt kp2d) are auxiliary constraints. But, I agree single-frame is more important.

For point 2, if I finetune on the 3DPW training set, should GT 3D mesh/joints be used? In tab.4, I fine-tuned SPIN on 3DPW test set with GT 2D keypoints (termed as *SPIN). Also, I compared with other baselines which are fine-tuned on 3DPW training set. Please refer to this table for more details.

syguan96 commented 2 years ago

As for analyzing the noise distribution of each joint, I think using the results detected by AlphaPose is not appropriate. How to deal with missed joints is hard. Top-down methods may be a more appropriate alternative. This is my guess. Maybe we can chat by email or WeChat(shuishiguanshanyan).

juxuan27 commented 2 years ago

Thank you for your answer! I've added your WeChat!