Closed makeme-zgz closed 2 years ago
During training, I add masks into the self-attention module so that the zero padding will not influence the calculation results. Therefore, it will not contain zero padding in the generated motion during inference.
lets say that the output shape from transformer is batch x frames x ..., 'frames' determines the length of the motion. do you mean, during generation, there is no masking applied, thus every frame generated will be valid?
then how do we generate various length of motions?
then how do we generate various length of motions?
We trained our model with varied-length motion sequences during training. Therefore, you can assign different motion lengths to generate different motions.
lets say that the output shape from transformer is batch x frames x ..., 'frames' determines the length of the motion. do you mean, during generation, there is no masking applied, thus every frame generated will be valid?
Yes. But we can assign different initial lengths (60, 100, 196, and so on). Although every frame will be generated with the given motion length, we can still generate motions with different given value.
In application, we normally give the text only to guide the generation. You suggest we predict the motion length first thru the given text first, and then supply the text and motion len to the model for motions?
In application, we normally give the text only to guide the generation. You suggest we predict the motion length first thru the given text first, and then supply the text and motion len to the model for motions?
Predicting expected length is one of the solutions. Generating Diverse and Natural 3D Human Motions from Text opts this method. From my perspective, the ideal application is that we give the program a timeline, and point out what action we expect for each time interval. Then the program can generate a long motion sequence under the given constraints. In this situation, I think it's an acceptable thing for users to give their expected motion lengths.
Thanks so much for the insights and suggestion!
Normally in the inference process, we only provide the text to guide the generation, and the generated motion can contain the zero padding, since we add padding when training. My question is how can we remove the predicted padding in the generated motion?