Custom Dataset - Githubissues

seungyonglee0802 commented 2 months ago

Thanks for your nice work :) I'm trying to apply ScoreHMR for SMPL-X model. In this case, I think I need to train diffusion model using SMPL-X dataset like BEDLAM. (Since pose parameters are different b/w SMPL and SMPL-X, it's inevitable to redesign FN model and train from scratch) I couldn't find the way you made the appropriate dataset to train ScoreHMR. Could you explain how to make a custom dataset? Thanks again, Ryan.

++) I have one more question. I'm not sure what vitpose_2d[[0, 16, 15, 18, 17, 5, 2, 6, 3, 7, 4, 12, 9, 13, 10, 14, 11]] this intend in demo_image.py. I hope you clarify this. Thanks

statho commented 2 months ago

Hello! Yes, to make it work with the SMPL-X model you would need to train the model from scratch. In our case we used model fits from SPIN and EFT. For Human3.6M we used fits from MoSh because they are more accurate. We used the fits that were publicly available (released by the authors of those papers or follow-up works).

To make ScoreHMR work with the SMPL-X model, you would need to create a dataset with (image, SMPL-X pseudo-GT) pairs through model fitting (to 2D keypoint detection). You should use a keypoint detector that also estimates the keypoint location for joints in the hand and face. Please take a look at the SMPL-X fitting code from this repo. Another thing you could do is start model fitting with SMPL-X by a regression estimate, so have a look at SMPLer-X for this. An important step is to only keep the model fits that are accurate and discard invalid fits when creating your dataset. A proxy for discarding invalid fits is the reprojection error.

Finally, the line in the demo_image.py script is there to make sure that the detected keypoints from ViTPose have the correct format, which can be seen here.

seungyonglee0802 commented 2 months ago

Thanks for your reply @statho :) I have some questions about the reply.

Q1,2.

To make ScoreHMR work with the SMPL-X model, you would need to create a dataset with (image, SMPL-X pseudo-GT) pairs through model fitting (to 2D keypoint detection).

I'm asking because I've never trained models using SMPL (or SMPL-X) before. Why do I need pseudo ground truth? Isn't there an image and SMPL(-X) model parameter paired in the public dataset?
I'm not sure why did you mention 2D keypoint detection... Does it mean that I need 2D keypoints using 3D SMPL-X joints?

Q3.

Finally, the line in the demo_image.py script is there to make sure that the detected keypoints from ViTPose have the correct format, which can be seen here.

For the 2d keypoint conditioned diffusion, from the paper, I need to calculate $$||y-A(x_0^{hat}(xt))||{2}^{2}$$, and it means it's essential to get (2D projected) 25 SMPL joints from Openpose detector or ViTPose detector you used. However, I think from the line, you got only 17 keypoints from ViTPose. I'm not sure what I missed.

ps. I am new to the field of Human Mesh Recovery and I'm into the related papers after this CVPR. Even if my questions are strange, I hope you can understand and answer them. Thanks

statho commented 2 months ago

Hello! Below are the answers to your questions:

The datasets I used in the paper were originally released with 2D keypoint annotations (COCO, MPI) or 2D and 3D keypoints annotations (Human3.6M, MPI-INF-3DHP). SMPL p-GT are available by follow up works (that performed model fitting either to 2D or 3D keypoints or markers depending on the availability in each dataset). As a starting point you could explore Motion-X that contain (images, SMPL-X p-GT).
What I was describing is that would need to detect 2D keypoints from another system (e.g., we use OpenPose or ViTPose for SMPL), and the perform model fitting (run an iterative optimization procedure) to fit the parametric model to the 2D keypoint detections. The main component of your loss term would be a loss between the 2D keypoint detections and the SMPL-X joints -- after reprojecting the 3D joints to 2D based on a camera (that also needs to be optimized). However, I would suggest to use any existing datasets (e.g., Motion-X) and skip this dataset construction step.
Yes, ViTPose only detects 17 keypoints, and this line converts their format to the OpenPose format (yes, OpenPose uses a different definition for them and detects 25 keypoints). This is not an issue since we just need a correspondence between the SMPL joints and the detected joints. This correspondence has been established for the OpenPose joints, that's is why we convert the keypoint detection to that format.

seungyonglee0802 commented 2 months ago

Thanks again @statho. I hope I could share my results after I finish my project :) This is my last question. You said the OpenPose keypoints and SMPL reprejected 2D joints has correspondence. Then how about SMPLX? Joints of SMPL and SMPLX are similar but not same. This means OpenPose and SMPLX 2D joints doesn't have correspondence, doesn't it? To sum up, from Eq. 12 of your paper, I think it's inevitable to get $$y{kp}$$ correspond to SMPLX joints, and for SMPLX, I think OpenPose is not a candidate for $$y{kp}$$.

statho commented 2 months ago

Hello! We used only 25 joints from OpenPose, but it can detect 135 joints. Please see the definition of SMPL-X and OpenPose joints from here.

The mapping between the SMPL-X and OpenPose joints can be found here. Essentially what this function does is to define a mapping from SMPL-X joints to OpenPose joints, so that you can use them to compute the keypoint reprojection loss. A simplification of this for SMPL can be found here, where we map SMPL joints to the corresponding ones from OpenPose.

statho / ScoreHMR

Custom Dataset #30