Questions about Transformer Network and Model Training

zhixinwang commented 3 years ago

Hi, @liruilong940607 I have some questions about the input and output of transformer network.

As the paper said, during training, the input of the Audio Transformer is music feature Y with shape (240, 35) and the input of Motion Transformer is motion data X with shape (120, 219) , so for all of Q,K,V are same, include the query vector of decoder of these transformer. Is this right?
For the cross-modal transformer, assume I get H(X) (240, 512) and H(Y) (120, 512) after audio and motion transformer, how to fuse features from music and motion features? Just concatenate both in sequence dimension resulting in features with shape (240 + 120, 512)?
What's the query vector of cross-modal transformer decoder? How to get a motion output with 20 frames?
What is the representation of motion prediction? Is it same as the input motion data with 24*3*3 + 3 = 219. But I think it is not good to regress a 3x3 rotation matrix directly? And for the regression of global translation, do you directly regress the absolute translation (x,y,z) in 3D space? Or just the offset relative to the first frame of seed motion? Thanks!

liruilong940607 commented 3 years ago

Hi @zhixinwang,

Thanks for your interest in our work. For your questions,

1/ The transformer we use corresponds to only the encoder in a typical encoder-decoder transformer. So Motion/Audion transformer takes in a sequence and outputs a sequence with the same length. The attention here is the self-attention, which means Q, K, V all come from the same frame.

2/ You are right. Just concatenation along the sequence dimension.

3/ As mentioned in 1/, there is no decoder in the cross-modal transformer neither. So it outputs a sequence with the same sequence length (240 + 120). In practice, we only supervise the first 20 frames and skip the others.

4/ The output is also rotation matrix plus root trajectories, same as the format of the input. So what you understand is correct. The root trajectories are absolute translations in the world space. It is claimed in this paper that 6D representation is better but we find that in our experiments using rotation matrix representation is just fine.

zhixinwang commented 3 years ago

@liruilong940607 Thank you very much! I have tried the new model, but now I find current 20 motion predictions are all same after convergence. Have you encounter this problem?

liruilong940607 commented 3 years ago

I never met this kind of issue. Are you using causal attention or bi-directional attention? In our experiments, causal attention leads to freezing motion predictions after several autoregressive predicting steps. However, in the first several steps it still gives you smooth motion predictions.

BTW, we are working on releasing the model in the near future so stay tuned!

JoyMYZ commented 3 years ago

Hi, @liruilong940607 I have some further questions based on the above discussion. As your paper said, the shape of audio feature is (240, 35) and the shape of motion feature is (120, 219), since the feature dimensions of them are different, I guess you have changed the feature dimension before concatenation, is that happens before or after the computation of single modality transformer?

liruilong940607 commented 3 years ago

@JoyMYZ Yes, it is the case. Both the motion & audio transformers output 800-dim features so they can be concatenated.

zhixinwang commented 3 years ago

I have solved the problem. I use pre-norm transformers instead of the original version. Thanks.

starlesschan commented 3 years ago

I have solved the problem. I use pre-norm transformers instead of the original version. Thanks.

Hi @zhixinwang , I have some questions about the inference

Is the model trained with input (120+240) frames and output 120+240 frames, but only the first 20 frames are supervised?
When it comes to inference, how does the model do autoregressive inference, which I don't really understand, if given a period of music from 60s, how does the model do autoregressive inference?

liruilong940607 commented 3 years ago

@starlesschan For your questions: 1/ Yes that is correct. 2/At each step, it takes in a motion seed with 120 frames and music with 240 frames. Only the first frame of the output is preserved and appended to the input motion seed. Then both the input motion and music are shifted by 1-frame to remain the same length to generate the next motion frame.

starlesschan commented 3 years ago

@starlesschan For your questions: 1/ Yes that is correct. 2/At each step, it takes in a motion seed with 120 frames and music with 240 frames. Only the first frame of the output is preserved and appended to the input motion seed. Then both the input motion and music are shifted by 1-frame to remain the same length to generate the next motion frame.

thanks your reply, it helps a lot, I have tried the way you said, but I still have some questions:

So, you split all the paired music and dance data into samples contained 120 motion frames and 240 music frames, and then the next 20 motion frames were used as labels for training, without using autoregression in the training stage, but only in the inference stage, right？
Do all three Transformer models use the encoder parts of the original version Transformer?
I use 2D key points for training. In the process of autoregressive generation, sometimes the human bone points do not conform to the human body structure and the bone points are not normal, and the generated characters' movements drift irregularly on the screen and are not stable. Have you ever encountered such problems?

liruilong940607 commented 3 years ago

@starlesschan 1/ and 2/ are both correct.

For 3/, it's quite strange as we didn't encounter such problems. 2D keypoints should be able to work, though we didn't try it. One suggestion I have is to check the training loss. If the training loss smoothly reduces to a reasonable value, the auto-regressive inference should not generate such abnormal results.

Jhc-china commented 3 years ago

Hi @liruilong940607, I have some questions on three aspect from discussions above:

training a). output (360 frames) are supervised on the first 20 frames, I also encounter the problem "20 motion predictions are all same" which @zhixinwang has. Do you use warm-up stage when training the original version transformer? BTW, I use the pre-norm transformer to alleviate this problem. b). I found the "smpl_trans" in motion data is large, comparing with the rotation matrix converted from "smpl_poses". Do you directly regress the 3-dim absolute translations in the world space or pre-scaling the translations by the "smpl_scaling" during training?
inference you said "Only the first frame of the output is preserved and appended to the input motion seed". When I use the auto-regressive generation method you mentioned above, I found the motion changes quickly and not stable (It seems to have been accelerated). However, when I use the first 20 frames output and shift the input music&dance by 20-frames, the motion is more nature. Why there is a difference between train and inference stage (20 frames are supervised during training while only 1 frame is used to shift during inference)?
testing when I test the model for complete music (e.g. testing the sFM sample), the generated motion is either not diverse (repeating single motion) or freezing. Do you encounter this kind of problem when generating long sequence (not diverse or nature)? And is there any better suggestion about generating smooth and diverse motion sequence?

starlesschan commented 3 years ago

Hi @liruilong940607, I have some questions on three aspect from discussions above:

training a). output (360 frames) are supervised on the first 20 frames, I also encounter the problem "20 motion predictions are all same" which @zhixinwang has. Do you use warm-up stage when training the original version transformer? BTW, I use the pre-norm transformer to alleviate this problem. b). I found the "smpl_trans" in motion data is large, comparing with the rotation matrix converted from "smpl_poses". Do you directly regress the 3-dim absolute translations in the world space or pre-scaling the translations by the "smpl_scaling" during training?

inference you said "Only the first frame of the output is preserved and appended to the input motion seed". When I use the auto-regressive generation method you mentioned above, I found the motion changes quickly and not stable (It seems to have been accelerated). However, when I use the first 20 frames output and shift the input music&dance by 20-frames, the motion is more nature. Why there is a difference between train and inference stage (20 frames are supervised during training while only 1 frame is used to shift during inference)?

testing when I test the model for complete music (e.g. testing the sFM sample), the generated motion is either not diverse (repeating single motion) or freezing. Do you encounter this kind of problem when generating long sequence (not diverse or nature)? And is there any better suggestion about generating smooth and diverse motion sequence?

I encountered the same problem with you, the motion changes quickly and not stable in inference stage.

Jhc-china commented 3 years ago

Hi @liruilong940607, I have some questions on three aspect from discussions above:

training a). output (360 frames) are supervised on the first 20 frames, I also encounter the problem "20 motion predictions are all same" which @zhixinwang has. Do you use warm-up stage when training the original version transformer? BTW, I use the pre-norm transformer to alleviate this problem. b). I found the "smpl_trans" in motion data is large, comparing with the rotation matrix converted from "smpl_poses". Do you directly regress the 3-dim absolute translations in the world space or pre-scaling the translations by the "smpl_scaling" during training?

inference you said "Only the first frame of the output is preserved and appended to the input motion seed". When I use the auto-regressive generation method you mentioned above, I found the motion changes quickly and not stable (It seems to have been accelerated). However, when I use the first 20 frames output and shift the input music&dance by 20-frames, the motion is more nature. Why there is a difference between train and inference stage (20 frames are supervised during training while only 1 frame is used to shift during inference)?

testing when I test the model for complete music (e.g. testing the sFM sample), the generated motion is either not diverse (repeating single motion) or freezing. Do you encounter this kind of problem when generating long sequence (not diverse or nature)? And is there any better suggestion about generating smooth and diverse motion sequence?

I encountered the same problem with you, the motion changes quickly and not stable in inference stage.

yep, did you try to regress the prediction in 20 frames fashion, which I mean concatenating first 20 frames from the 360 frames prediciton then shift the input music&motion by 20 frames. I could generate smoother motion in this way for short sequence(about 10s) but cannot generate longer sequence (about 40s) well.

starlesschan commented 3 years ago

Hi @liruilong940607, I have some questions on three aspect from discussions above:

training a). output (360 frames) are supervised on the first 20 frames, I also encounter the problem "20 motion predictions are all same" which @zhixinwang has. Do you use warm-up stage when training the original version transformer? BTW, I use the pre-norm transformer to alleviate this problem. b). I found the "smpl_trans" in motion data is large, comparing with the rotation matrix converted from "smpl_poses". Do you directly regress the 3-dim absolute translations in the world space or pre-scaling the translations by the "smpl_scaling" during training?

inference you said "Only the first frame of the output is preserved and appended to the input motion seed". When I use the auto-regressive generation method you mentioned above, I found the motion changes quickly and not stable (It seems to have been accelerated). However, when I use the first 20 frames output and shift the input music&dance by 20-frames, the motion is more nature. Why there is a difference between train and inference stage (20 frames are supervised during training while only 1 frame is used to shift during inference)?

testing when I test the model for complete music (e.g. testing the sFM sample), the generated motion is either not diverse (repeating single motion) or freezing. Do you encounter this kind of problem when generating long sequence (not diverse or nature)? And is there any better suggestion about generating smooth and diverse motion sequence?

I encountered the same problem with you, the motion changes quickly and not stable in inference stage.

yep, did you try to regress the prediction in 20 frames fashion, which I mean concatenating first 20 frames from the 360 frames prediciton then shift the input music&motion by 20 frames. I could generate smoother motion in this way for short sequence(about 10s) but cannot generate longer sequence (about 40s) well.

yes, I have tried this way. I could generate relatively natural results but the results are not smooth. I haven't had any good results yet

google / aistplusplus_api

Questions about Transformer Network and Model Training #10