Closed zhixinwang closed 3 years ago
Hi @zhixinwang,
Thanks for your interest in our work. For your questions,
1/ The transformer we use corresponds to only the encoder
in a typical encoder-decoder transformer. So Motion/Audion transformer takes in a sequence and outputs a sequence with the same length. The attention here is the self-attention, which means Q, K, V all come from the same frame.
2/ You are right. Just concatenation along the sequence dimension.
3/ As mentioned in 1/, there is no decoder in the cross-modal transformer neither. So it outputs a sequence with the same sequence length (240 + 120). In practice, we only supervise the first 20 frames and skip the others.
4/ The output is also rotation matrix plus root trajectories, same as the format of the input. So what you understand is correct. The root trajectories are absolute translations in the world space. It is claimed in this paper that 6D representation is better but we find that in our experiments using rotation matrix representation is just fine.
@liruilong940607 Thank you very much! I have tried the new model, but now I find current 20 motion predictions are all same after convergence. Have you encounter this problem?
I never met this kind of issue. Are you using causal attention or bi-directional attention? In our experiments, causal attention leads to freezing motion predictions after several autoregressive predicting steps. However, in the first several steps it still gives you smooth motion predictions.
BTW, we are working on releasing the model in the near future so stay tuned!
Hi, @liruilong940607 I have some further questions based on the above discussion. As your paper said, the shape of audio feature is (240, 35) and the shape of motion feature is (120, 219), since the feature dimensions of them are different, I guess you have changed the feature dimension before concatenation, is that happens before or after the computation of single modality transformer?
@JoyMYZ Yes, it is the case. Both the motion & audio transformers output 800-dim features so they can be concatenated.
I have solved the problem. I use pre-norm transformers instead of the original version. Thanks.
I have solved the problem. I use pre-norm transformers instead of the original version. Thanks.
Hi @zhixinwang , I have some questions about the inference
@starlesschan For your questions: 1/ Yes that is correct. 2/At each step, it takes in a motion seed with 120 frames and music with 240 frames. Only the first frame of the output is preserved and appended to the input motion seed. Then both the input motion and music are shifted by 1-frame to remain the same length to generate the next motion frame.
@starlesschan For your questions: 1/ Yes that is correct. 2/At each step, it takes in a motion seed with 120 frames and music with 240 frames. Only the first frame of the output is preserved and appended to the input motion seed. Then both the input motion and music are shifted by 1-frame to remain the same length to generate the next motion frame.
thanks your reply, it helps a lot, I have tried the way you said, but I still have some questions:
@starlesschan 1/ and 2/ are both correct.
For 3/, it's quite strange as we didn't encounter such problems. 2D keypoints should be able to work, though we didn't try it. One suggestion I have is to check the training loss. If the training loss smoothly reduces to a reasonable value, the auto-regressive inference should not generate such abnormal results.
Hi @liruilong940607, I have some questions on three aspect from discussions above:
Hi @liruilong940607, I have some questions on three aspect from discussions above:
- training a). output (360 frames) are supervised on the first 20 frames, I also encounter the problem "20 motion predictions are all same" which @zhixinwang has. Do you use warm-up stage when training the original version transformer? BTW, I use the pre-norm transformer to alleviate this problem. b). I found the "smpl_trans" in motion data is large, comparing with the rotation matrix converted from "smpl_poses". Do you directly regress the 3-dim absolute translations in the world space or pre-scaling the translations by the "smpl_scaling" during training?
- inference you said "Only the first frame of the output is preserved and appended to the input motion seed". When I use the auto-regressive generation method you mentioned above, I found the motion changes quickly and not stable (It seems to have been accelerated). However, when I use the first 20 frames output and shift the input music&dance by 20-frames, the motion is more nature. Why there is a difference between train and inference stage (20 frames are supervised during training while only 1 frame is used to shift during inference)?
- testing when I test the model for complete music (e.g. testing the sFM sample), the generated motion is either not diverse (repeating single motion) or freezing. Do you encounter this kind of problem when generating long sequence (not diverse or nature)? And is there any better suggestion about generating smooth and diverse motion sequence?
I encountered the same problem with you, the motion changes quickly and not stable in inference stage.
Hi @liruilong940607, I have some questions on three aspect from discussions above:
- training a). output (360 frames) are supervised on the first 20 frames, I also encounter the problem "20 motion predictions are all same" which @zhixinwang has. Do you use warm-up stage when training the original version transformer? BTW, I use the pre-norm transformer to alleviate this problem. b). I found the "smpl_trans" in motion data is large, comparing with the rotation matrix converted from "smpl_poses". Do you directly regress the 3-dim absolute translations in the world space or pre-scaling the translations by the "smpl_scaling" during training?
- inference you said "Only the first frame of the output is preserved and appended to the input motion seed". When I use the auto-regressive generation method you mentioned above, I found the motion changes quickly and not stable (It seems to have been accelerated). However, when I use the first 20 frames output and shift the input music&dance by 20-frames, the motion is more nature. Why there is a difference between train and inference stage (20 frames are supervised during training while only 1 frame is used to shift during inference)?
- testing when I test the model for complete music (e.g. testing the sFM sample), the generated motion is either not diverse (repeating single motion) or freezing. Do you encounter this kind of problem when generating long sequence (not diverse or nature)? And is there any better suggestion about generating smooth and diverse motion sequence?
I encountered the same problem with you, the motion changes quickly and not stable in inference stage.
yep, did you try to regress the prediction in 20 frames fashion, which I mean concatenating first 20 frames from the 360 frames prediciton then shift the input music&motion by 20 frames. I could generate smoother motion in this way for short sequence(about 10s) but cannot generate longer sequence (about 40s) well.
Hi @liruilong940607, I have some questions on three aspect from discussions above:
- training a). output (360 frames) are supervised on the first 20 frames, I also encounter the problem "20 motion predictions are all same" which @zhixinwang has. Do you use warm-up stage when training the original version transformer? BTW, I use the pre-norm transformer to alleviate this problem. b). I found the "smpl_trans" in motion data is large, comparing with the rotation matrix converted from "smpl_poses". Do you directly regress the 3-dim absolute translations in the world space or pre-scaling the translations by the "smpl_scaling" during training?
- inference you said "Only the first frame of the output is preserved and appended to the input motion seed". When I use the auto-regressive generation method you mentioned above, I found the motion changes quickly and not stable (It seems to have been accelerated). However, when I use the first 20 frames output and shift the input music&dance by 20-frames, the motion is more nature. Why there is a difference between train and inference stage (20 frames are supervised during training while only 1 frame is used to shift during inference)?
- testing when I test the model for complete music (e.g. testing the sFM sample), the generated motion is either not diverse (repeating single motion) or freezing. Do you encounter this kind of problem when generating long sequence (not diverse or nature)? And is there any better suggestion about generating smooth and diverse motion sequence?
I encountered the same problem with you, the motion changes quickly and not stable in inference stage.
yep, did you try to regress the prediction in 20 frames fashion, which I mean concatenating first 20 frames from the 360 frames prediciton then shift the input music&motion by 20 frames. I could generate smoother motion in this way for short sequence(about 10s) but cannot generate longer sequence (about 40s) well.
yes, I have tried this way. I could generate relatively natural results but the results are not smooth. I haven't had any good results yet
Hi, @liruilong940607 I have some questions about the input and output of transformer network.
24*3*3 + 3 = 219
. But I think it is not good to regress a 3x3 rotation matrix directly? And for the regression of global translation, do you directly regress the absolute translation (x,y,z) in 3D space? Or just the offset relative to the first frame of seed motion? Thanks!