Open PKUFlyingPig opened 4 years ago
Hi @PKUFlyingPig,
I'm not sure I understand your question. Since the gesture-generation system is driven by speech, speech is a necessary input to the system, and you need to have the speech track in order to generate output in the first place. Therefore you should already have the speech that goes along with the BVH motion you generated.
For our demonstration videos, we used a rendering pipeline to turn the BVH motion files into (silent) video clips of a moving avatar. I expect that the GENEA Workshop will release such a rendering pipeline in the near future, if you cannot set one up yourself. Once you have a video clips you can use ffmpeg or similar to add the speech as an audio track to these videos.
Thank you so much for your detailed explanation. Now I just want to test my pertained model on my self-recorded audio, but I found that when I want to synthesize the gesture, the code (i.e. generate_sample function in trainer.py) just take one test batch as input, which also needs the gesture ground truth. I noticed that on the paper website, you also provided the synthesized gesture of Obama, which certainly without the gesture ground truth. Can you supply the codes which support this kind of generation? Or which part of your code do I need to modify in order to do my test ?
@simonalexanderson knows the details of the synthesis implementation better than I do. But I am surprised to hear that synthesis needs the gesture ground-truth output specifically, and I don't think that's right. The whole point of gesture generation is that we don't know what the "correct" gestures are, so we have to create new gestures ourselves.
MoGlow and StyleGestures are autoregressive methods, and take previous poses as input when generating the next pose. They therefore require us to specify values for these previous poses for initialisation. However, these initial poses need not be taken from the ground truth. If you watched our MoGlow video on on YouTube, you will have seen that initialising the past pose inputs from the mean pose also works well. Although the comment "Initialize the pose sequence with ground truth test data" in trainer.py suggests that the current code uses recorded ground-truth motion for initialisation, this is just one way to do it and can easily be changed.
I hope this answers your question.
Thank you. I successfully run the model on my own audio input, but I found that the output is not as smooth or good as the original test set, whose speaker is the same person of the training set. But in your provided test output data on the paper website, I found there is a fold called other_speakers, which contained the output of other speaker's audio input. The Obama's output is not that ideal, but the podcast's output is quite promising. Do your pertained model see these people's audio before ? Or they are just run on the model which is only trained on the Trinity dataset ?
Good to hear that you managed to run the model successfully.
Our model was trained on the single-speaker Trinity Speech-Gesture Dataset only. The content in the other_speakers
folder is specifically intended to demonstrate the results of applying the pre-trained model to speakers who were not in the training set, so none of those speakers were "seen by the model before".
Thank you so much for your advice and patient answers, this model run quite well now with my own audio input! I am now trying to train the model on some new dataset which has different bone architecture with the trinity dataset. Do I need to change the preprocessing code ? Or even change some parameters of the model, like the input size or something else ?
The relevant code was written by @simonalexanderson, so this is not something that I have any direct experience of. I would suspect that there are parts of the code that makes assumptions about the skeleton, the number of joints, etc., and that the preprocessing has to be adapted to fit the characteristics of each dataset, but I don't know for sure. I think the best way to find out is to try it.
It's quite convenient to adapt the code to a new bone hierarchy ! but I am a bit confused about the code which adds the style control into the input, why set the control values to 15%, 50% and 85% quantiles? and what does the hard-coded constants mean (e.g. the dev_ctrl is an ndarray whose shape is ( samples, time steps, features ), why just fill the last feature and why fill it with a stepsize of three ?)
Also, there is a problem in the preprocessing code. When I tried to extract the full body motion data, the pipeline will do the pos_rot_deltas transform to the root, but the code doesn't support this method, although the RootTransformer class claims that it accepts the pos_rot_deltas method as input.
why set the control values to 15%, 50% and 85% quantiles?
We used those three constant control-input values to assess and verify the efficacy of the control in the experiments of our paper, particularly Figure 4.
If you implement a more capable and convenient control-input interface, please consider making a pull request! :)
what does the hard-coded constants mean (e.g. the dev_ctrl is an ndarray whose shape is (samples, time steps, features)
Unfortunately, I believe only @simonalexanderson can answer questions about the code at this level of detail.
why just fill the last feature and why fill it with a stepsize of three ?)
This is another question that only @simonalexanderson can answer with certainty. I have a vague suspicion that the stepsize of three might be related to the fact that we downsampled the original 60 fps motion-capture data to 20 fps for the modelling in the paper, but I could easily be wrong. This let us split a, say, one-second motion-capture clip at the original framerate (60 frames) into 3 one-second clips (20 frames) at the lower framerate. The first clip would contain the frames [1, 4, 7, ...], the second the frames [2, 5, 8, ...], and the final clip the frames [3, 6, 9, ...].
When I tried to extract the full body motion data, the pipeline will do the pos_rot_deltas transform to the root, but the code doesn't support this method, although the RootTransformer class claims that it accepts the pos_rot_deltas method as input.
Thanks for reporting this. I am not the coder on the team, so I hope @simonalexanderson will have time to look into this, to see where the potential flaw sits and what to do about it. But if you should resolve the issue on your own, you surely know that a pull request would be very welcome!
Thank you. I successfully run the model on my own audio input, but I found that the output is not as smooth or good as the original test set, whose speaker is the same person of the training set. But in your provided test output data on the paper website, I found there is a fold called other_speakers, which contained the output of other speaker's audio input. The Obama's output is not that ideal, but the podcast's output is quite promising. Do your pertained model see these people's audio before ? Or they are just run on the model which is only trained on the Trinity dataset ?
I'm sorry to trouble you. I don't know how to test my pertained model on my self-recorded audio. How can I preprocess my self-recorded audio so that the gesture can be inferred by my pretrained model based on my self-recorded audio? Thank you very much if you could give me some guidances.
You should first understand the code in trainer.py which generates the samples, and also the audio_features.py which extract the features from your audio input. If you don't want to write your own inference code, you can follow the following steps(might be a little stupid): 1. use ffmpeg to transform your audio into wav format 2. put it into the data/trinity/source/audio dir, and make a fake bvh file in data/trinity/source/bvh dir, their name must be consistent 3. change the variable heldout in prepare_datasets.py to fit your audio name. 4. run prepare_datasets, now the test and val datasets will have your audio control as input. 5. run train_moglow.py
You should first understand the code in trainer.py which generates the samples, and also the audio_features.py which extract the features from your audio input. If you don't want to write your own inference code, you can follow the following steps(might be a little stupid): 1. use ffmpeg to transform your audio into wav format 2. put it into the data/trinity/source/audio dir, and make a fake bvh file in data/trinity/source/bvh dir, their name must be consistent 3. change the variable heldout in prepare_datasets.py to fit your audio name. 4. run prepare_datasets, now the test and val datasets will have your audio control as input. 5. run train_moglow.py
Thank you. I really appreciate your patient answers. I can do it well now.
I found that some sampled motion's hands will spin unnaturally, even in the trinity test dataset. Did you ever encounter this problem? Looking forward to your advice and guidance.
some sampled motion's hands will spin unnaturally, even in the trinity test dataset. Did you ever encounter this problem?
I do not recall seeing a specific problem which would fit this description. Although this is not my area of expertise, two hypotheses are that it either could be related to the breakpoint angle used for phase wrapping, or be a consequence of gimbal locking, which also can cause unphysical rotations to occur.
You can find BVH files of motion samples generated by the StyleGestures model for the entire GENEA test set in this Zenodo repository. The StyleGestures system is coded as condition SC.
@PKUFlyingPig Many thanks for your kind help about the method to test my pertained model on my self-recorded audio last few days. I'm sorry for disturbing you again. If I have disturbed you, I hope you can forgive me. I am still reproducing the paper recently and I encounter some difficult puzzles. It's been bothering me for few days. I would be grateful if you could give me some guidances. I notice that when the global steps reach the number of 'plot_gap', the model will sample some '.bvh' files. How can I find the corresponding audio files? I supposed that the corresponding audio files should be in the 'visualization_dev' folder, but the quantity of the audio files in the folder is different with the quantity of output '.bvh' files. How can I find the corresponding audio clips of the output 'bvh' files? For example, I selected the 'Recording_002.wav' as the variable heldout in prepare_datasets.py and the total time of the 'Recording_002.wav' file is 7 minutes and 10 seconds. The quantity of the sampled data from the model is 20. The form of the sampled data just likes 'sampled_0_temp100_xxk_0 ' to 'sampled_0_temp100_xxk_19'. I can't find the corresponding audio clips of the sampled 20 output 'bvh' files. So I can't evaluate the consistence between the audio files and corresponding sampled gestures. I'm sorry for disturbing you again. I would be grateful if you could give me some help. I am looking forward to hearing from you! Best wishes for you!
" I can't find the corresponding audio clips of the sampled 20 output 'bvh' files. So I can't evaluate the consistence between the audio files and corresponding sampled gestures." the audio it used to sample is in the visualization_test fold. You can look into the prepare_dataset.py and train_moglow.py to understand better.
Hi, @ghenter. Thank you for your advice to use the GENEA pipeline to render the video, but I found my output seems a little different than yours.
I think it has something wrong with my camera view, but I didn't modify anything about the bvh_writer code. How can I solve this ?
Dear @PKUFlyingPig,
I am not a coder nor a 3D modeller and have no hands-on experience of the visualiser, so it's difficult for me to advise you. I agree that the camera angle in the screenshot appears poor. Does this happen when you visualise recorded natural motion from the GENEA Challenge data you trained on? If yes, I would guess it's an issue with server settings. If no, the issue is probably in your BVH output files (although the root cause of the problem could sit anywhere in training or synthesis; it's hard to know). You might want to check if the root node in the BVH is rotated unfavourably with respect to the camera, because that seems easy (at least in principle) to fix using post-processing.
Dear @ghenter @PKUFlyingPig When i run "python train_moglow.py 'hparams/paper_version/style_gestures.json' trinity" , i encounter such a problem as follow:
100%|██████████| 230/230 [03:24<00:00, 1.12it/s]
Loss: -226.62378/ Validation Loss: 8.28833
epoch 172
100%|██████████| 230/230 [03:30<00:00, 1.09it/s]
Loss: -231.71614/ Validation Loss: 8.28833
epoch 173
91%|█████████▏| 210/230 [03:12<00:18, 1.09it/s][Checkpoint]: remove results/GENEA/log_20210911_1556/checkpoints/save_28k0.pkg to keep 3 checkpoints
generate_sample
generate_sample
inverse_transform...
Traceback (most recent call last):
File "train_moglow.py", line 53, in <module>
trainer.train()
File "/data/hehaiyang/project/code/CrossModal/Gesture/StyleGestures/glow/trainer.py", line 218, in train
self.generator.generate_sample(self.graph, eps_std=1.0, step=self.global_step)
File "/data/hehaiyang/project/code/CrossModal/Gesture/StyleGestures/glow/generator.py", line 82, in generate_sample
self.data.save_animation(control_all[:,:(n_timesteps-n_lookahead),:], sampled_all, os.path.join(self.log_dir, f'sampled_{counter}_temp{str(int(eps_std*100))}_{str(step//1000)}k'))
File "/data/hehaiyang/project/code/CrossModal/Gesture/StyleGestures/motion/datasets/trinity.py", line 72, in save_animation
self.write_bvh(anim_clips, filename)
Do you know what is the reason? Thanks~
@PKUFlyingPig I'm sorry to bother you so late. I'm a student of Fuzhou University. I want to run Example3 in readme, but I don't know what I need to prepare before running, or input what I need to do? And where can I find the input audio data?
hello, @PKUFlyingPig , A problem occurred while I was running step 2: ''FileNotFoundError: [Errno 2] No such file or directory: '../data/GENEA/processed/features_20fps/joint_rot/data_pipe.sav'“ How can I get this file? I hope you can see this message.
Hi, @ghenter. Thank you for your advice to use the GENEA pipeline to render the video, but I found my output seems a little different than yours. I think it has something wrong with my camera view, but I didn't modify anything about the bvh_writer code. How can I solve this ?
hello,I run this code python train_moglow.py 'hparams/preferred/style_gestures.json' trinity .but the result I got was a BVH file, not a 3D model like yours. Is there something wrong with me?
the result I got was a BVH file, not a 3D model like yours
This is as expected. Our code generates BVH files, which essentially contain a series of pose specifications for a 3D avatar. (Very roughly speaking, it's something like "in this frame, hold your arm like this, in the next frame, then...") To create a video of what the motion described by the BVH file looks like on an avatar, you need to use 3D software, decide which avatar to use, decide where to put the camera relative to the avatar, etc. For the GENEA 2020 data, I strongly recommend that you use the official GENEA Challenge 2020 BVH visualiser at https://github.com/jonepatr/genea_visualizer .
the result I got was a BVH file, not a 3D model like yours
This is as expected. Our code generates BVH files, which essentially contain a series of pose specifications for a 3D avatar. (Very roughly speaking, it's something like "in this frame, hold your arm like this, in the next frame, then...") To create a video of what the motion described by the BVH file looks like on an avatar, you need to use 3D software, decide which avatar to use, decide where to put the camera relative to the avatar, etc. For the GENEA 2020 data, I strongly recommend that you use the official GENEA Challenge 2020 BVH visualiser at https://github.com/jonepatr/genea_visualizer .
Thank you. I really appreciate your patient answers.
I found that when I use my pertained model to synthesize new gesture following the guidance, there is only the bvh output, how can I get the paired audio data ?