mmahdavian / STPOTR

Human Pose and Hip Trajectory Prediction Using Transformers
GNU General Public License v3.0
10 stars 2 forks source link

Decoder #8

Open y1131388949 opened 1 month ago

y1131388949 commented 1 month ago

I noticed that in both training and validation, the input of the decoder is the sequence of truth values to be predicted, but what should be the input of the decoder when using the trained STPosetransformer for prediction? Your paper states that the last frame of the input sequence is copied as the decoder input, what exactly does this look like? If I want to use my own recognized 3D keypoints of the human body as input to predict, what format should the input of the decoder be?

mmahdavian commented 1 month ago

Hi @y1131388949 . Each skeleton frame is 17 joints and each joint is 3 numbers. x,y,z. So you need 5x17x3 as input to encoder. You need to copy the fifth frame 20 times and make it 20x17x3 and input that to decoder. Depending on how you would train the model, you can decrease the first joint value of all frames from the rest of the joints. So all joints values would be relative to hip joint. Also you can normalize the input and denormalize the output for better performance. These are settings used for training the provided pre-trained model.

y1131388949 commented 1 month ago

Thank you very much for your answer, it was very useful for me. I noticed that in the H36MDataset_v3 file, the selected keypoints for the human body are _MAJOR_JOINTS = [0, 1, 2, 5, 6, 7, 11, 12, 13, 14, 16, 17, 18, 24, 25, 26], which is a total of 16 keypoints, not 17 as you said. Which parts of the human body do these points correspond to? The key point diagrams I've looked up online don't seem to match up.

y1131388949 commented 1 month ago

H36M

mmahdavian commented 1 month ago

@y1131388949 As far as I remember, this is the skeleton structure: fig

If there are 16 joints used in the data loading part, I guess it's removing the hip joint.