google-research / mint

Multi-modal Content Creation Model Training Infrastructure including the FACT model (AI Choreographer) implementation.
Apache License 2.0
510 stars 87 forks source link

Root translation wrong after 2 seconds #42

Open abeacco opened 2 years ago

abeacco commented 2 years ago

Hi,

Congratulations for such a great work.

We've been trying to recover the generated animations (in the output .npy) files to an fbx, and we are almost there. Since you are normalizing the translation of the root, we are multiplying it again by the scale of our character in order to use. However, this only seems to work for the first 2 seconds for every generated motion, after what the root translation gets kind of exagerated and wrong.

Any idea why this could be happening?

Thanks,

liruilong940607 commented 2 years ago

Hi how "wrong" you are talking about the root translation? It is definitely not stable sometime. But it shouldn't be way off.

abeacco commented 2 years ago

Hi,

It is quite bad to be honest. Specially compared to the first two seconds where it is perfect. It depends on the inferred motion, for sure, but you can get displacements that are not at all similar to the training motions, in wrong directions compared to the rotation, etc... Of course it is difficult to tell because it is a "dance" motion where you can have some foot sliding artifacts, but I don't think it should be so much exagerated.

My sensation is that the scale by which I am multiplying the root translation for the first two seconds is correct, but then, it feels like it should be another one. But it makes no sense since you are normalizing it for the whole inferred motion, right?

abeacco commented 2 years ago

Here is one for example gBR_sBM_cAll_d04_mBR0_ch01_mBR0.zip .

liruilong940607 commented 2 years ago

I'm not be able to view this .fbx file in blender or any other online viewers. Something might be wrong with this file?

Yeah the normalization is same across the entire sequence. So you don't need to treat other frames differently than the first two seconds. But the first two seconds are groundtruth motion, which serves as a seed to trigger the generation. So it is not superising that the following motions have worse quality

Some of the results may have less satisfying results and some are better. Have you try other sequences?

abeacco commented 2 years ago

Hi,

The .fbx file is ok. You can't import in Blender aparently because it is an ASCII file and blender only supports binary. So I am attaching it again in binary format.

The thing is that it only contains the skeleton and the animation, not the model (otherwise it would be a much more hug gBR_sBM_cAll_d04_mBR0_ch01_mBR0_BINARY.zip e file). You can also try opening it with Autodesk FBX Review and check the option to view "All" (not "only models").

We have tried several times most of motions generated by the evaluation test and they all have wrong root motion after two seconds. Which makes sense with what you say and what I thought, that the first two seconds are pure groundtruth motion.

I really think that your method is awsome and the results are amazing, but for them to be practical and usable, there still this root motion problem to be solved.

Maybe there is some other specific technique that can be dedicated to this? Any ideas? I'll try to look into that.

Otherwise, do you think that your method could be adapted to obtain better results for the root motion?

Thank you very much, I really appreciate your help and your big work :)

liruilong940607 commented 2 years ago

Hi for some reason I still can't view this fbx file, even in Autodesk.

Could you sent me the raw output of the model, before converting to the fbx? It would save me some efforts on deploying this repo and running the inference. I'm currently a bit occupied so if you can share me the raw output I'm happy to take a look.

Nevertheless, it is true the root translation is harder than the overall joint rotations to learn.

Screen Shot 2022-03-12 at 7 19 56 PM
abeacco commented 2 years ago

Hi,

As I said, there is no model in the fbx file. Only the skeleton with the animation, so right where you are, go to settings and select displayed object mode to "All".

image

You'll find the character some units above the ground.

Also, please find attached again the original output .npy file and the correspondent converted .fbx file. gBR_sBM_cAll_d04_mBR0_ch01_mBR0.zip

Thanks,

liruilong940607 commented 2 years ago

I see haha. Thanks for the teaching!!

Yeah the results are worse than I expected, not only the translation but also the joint rotations.

I suspect our released model may not been fully trained comparing to the one we used for the paper. (This is a refactorized code base so the model is retrained after the submission). Also someone reports that retrain the model gives better performance.

Here I attached some random results from our raw model. For your reference. Archive.zip

Thanks for reporting this! I'm a bit full in the hands recently. I will look into this in the coming weeks.

Best, Ruilong

abeacco commented 2 years ago

Hi,

I just saw your random results and they also have a lot of root problems to be practical animations. I think you could do some post-processing to fix this (although not straight forward), or apply some physics (again, quite difficult). Maybe, a good approach like I've seen in many others would be to annotate the training data with information about footsteps and stance foot, so that you can infere that too with the new sinthesized motion, and use that to produce the exact root motion.

Let me know if I can help,

Thanks,

Alejandro Beacco

liruilong940607 commented 2 years ago

Yeah those are very good ideas. There are also many cool ideas from this line of works: https://github.com/sebastianstarke/AI4Animation

Also because of the auto-regressive inference, the error accumulate during inference. So another way to improve this is to train the model with auto-regressive scheme as well, a.k.a, using the inference motion to train the model.

Unfortunately I don't have plans to push for this project in the near future. But feel free to try those cool ideas and post here with any progress!

Noyii commented 2 years ago

@abeacco Hi, how do you transfer .npy file to .fbx file? I've tried without success.

CuberFan commented 2 years ago

https://github.com/softcat477/SMPL-to-FBX

miibotree commented 2 years ago

I retrained the model from scratch, using batch_size=64 while keeping other parameters same. After 7W steps, the loss seems convergence, so I stopped training and check the result. However the result seems not good. 截屏2022-07-06 上午11 15 58

bad_result

liruilong940607 commented 2 years ago

We trained our model for ~ 3 days. The loss converged to the level of 1e-5 or 1e-6 (I can't clearly recall). It is important to train it long enough so that the model converged to a very good status. The inference is auto-regressive so error would accumulate which will be amplified if the model is not trained very well.

miibotree commented 2 years ago

We trained our model for ~ 3 days. The loss converged to the level of 1e-5 or 1e-6 (I can't clearly recall). It is important to train it long enough so that the model converged to a very good status. The inference is auto-regressive so error would accumulate which will be amplified if the model is not trained very well.

I double-checked the tfrecord generation script in tools/preprocessing.py (LINE 167-182), for testval, you also random generate un-paired data:

# If testval, also test on un-paired data
if FLAGS.split == "testval":
    logging.info("Also add un-paired motion-music data for testing.")
    for i, seq_name in enumerate(seq_names * 10):
        logging.info("processing %d / %d" % (i + 1, n_samples * 10))

        smpl_poses, smpl_scaling, smpl_trans = AISTDataset.load_motion(
            dataset.motion_dir, seq_name)
        smpl_trans /= smpl_scaling
        smpl_poses = R.from_rotvec(
            smpl_poses.reshape(-1, 3)).as_matrix().reshape(smpl_poses.shape[0], -1)
        smpl_motion = np.concatenate([smpl_trans, smpl_poses], axis=-1)
        audio, audio_name = load_cached_audio_features(random.choice(seq_names))

        tfexample = to_tfexample(smpl_motion, audio, seq_name, audio_name)
        write_tfexample(tfrecord_writers, tfexample)

which add 10 * 40 (20 val +20 test) unpaired-data to the val-test tfrecords. It seems that for paired data (motion with the same music), it works good. However for unpaired data, the motion tends to freeze. Maybe we should split pair and unpair data for different evaluation?

liruilong940607 commented 2 years ago

That's reasonable -- the results on the un-paired data is not as good as the results with paired motion seed and music

CuberFan commented 2 years ago

We trained our model for ~ 3 days. The loss converged to the level of 1e-5 or 1e-6 (I can't clearly recall). It is important to train it long enough so that the model converged to a very good status. The inference is auto-regressive so error would accumulate which will be amplified if the model is not trained very well.

seems 1e-4 ~ 1e-5 , not 1e-5 ~ 1e-6

lzyplayer commented 1 year ago

Hi,

Congratulations for such a great work.

We've been trying to recover the generated animations (in the output .npy) files to an fbx, and we are almost there. Since you are normalizing the translation of the root, we are multiplying it again by the scale of our character in order to use. However, this only seems to work for the first 2 seconds for every generated motion, after what the root translation gets kind of exagerated and wrong.

Any idea why this could be happening?

Thanks,

Hello, I would like to ask how you got the translation scale of the root node? By Character Height?

lvZic commented 1 year ago

Hi, Congratulations for such a great work. We've been trying to recover the generated animations (in the output .npy) files to an fbx, and we are almost there. Since you are normalizing the translation of the root, we are multiplying it again by the scale of our character in order to use. However, this only seems to work for the first 2 seconds for every generated motion, after what the root translation gets kind of exagerated and wrong. Any idea why this could be happening? Thanks,

Hello, I would like to ask how you got the translation scale of the root node? By Character Height?

hi, have u solved the scale?

Noyii commented 1 year ago

Hi, Congratulations for such a great work. We've been trying to recover the generated animations (in the output .npy) files to an fbx, and we are almost there. Since you are normalizing the translation of the root, we are multiplying it again by the scale of our character in order to use. However, this only seems to work for the first 2 seconds for every generated motion, after what the root translation gets kind of exagerated and wrong. Any idea why this could be happening? Thanks,

Hello, I would like to ask how you got the translation scale of the root node? By Character Height?

hi, have u solved the scale?

This work is totally messed up, I advise you not to follow this paper.