mingyuan-zhang / MotionDiffuse

MotionDiffuse: Text-Driven Human Motion Generation with Diffusion Model
https://mingyuan-zhang.github.io/projects/MotionDiffuse.html
Other
850 stars 74 forks source link

How to recognize the padding in generation process #2

Closed makeme-zgz closed 2 years ago

makeme-zgz commented 2 years ago

Normally in the inference process, we only provide the text to guide the generation, and the generated motion can contain the zero padding, since we add padding when training. My question is how can we remove the predicted padding in the generated motion?

mingyuan-zhang commented 2 years ago

During training, I add masks into the self-attention module so that the zero padding will not influence the calculation results. Therefore, it will not contain zero padding in the generated motion during inference.

makeme-zgz commented 2 years ago

lets say that the output shape from transformer is batch x frames x ..., 'frames' determines the length of the motion. do you mean, during generation, there is no masking applied, thus every frame generated will be valid?

makeme-zgz commented 2 years ago

then how do we generate various length of motions?

mingyuan-zhang commented 2 years ago

then how do we generate various length of motions?

We trained our model with varied-length motion sequences during training. Therefore, you can assign different motion lengths to generate different motions.

mingyuan-zhang commented 2 years ago

lets say that the output shape from transformer is batch x frames x ..., 'frames' determines the length of the motion. do you mean, during generation, there is no masking applied, thus every frame generated will be valid?

Yes. But we can assign different initial lengths (60, 100, 196, and so on). Although every frame will be generated with the given motion length, we can still generate motions with different given value.

makeme-zgz commented 2 years ago

In application, we normally give the text only to guide the generation. You suggest we predict the motion length first thru the given text first, and then supply the text and motion len to the model for motions?

mingyuan-zhang commented 2 years ago

In application, we normally give the text only to guide the generation. You suggest we predict the motion length first thru the given text first, and then supply the text and motion len to the model for motions?

Predicting expected length is one of the solutions. Generating Diverse and Natural 3D Human Motions from Text opts this method. From my perspective, the ideal application is that we give the program a timeline, and point out what action we expect for each time interval. Then the program can generate a long motion sequence under the given constraints. In this situation, I think it's an acceptable thing for users to give their expected motion lengths.

makeme-zgz commented 2 years ago

Thanks so much for the insights and suggestion!