DNO: Optimizing Diffusion Noise Can Serve As Universal Motion Priors
The official PyTorch implementation of the paper "DNO: Optimizing Diffusion Noise Can Serve As Universal Motion Priors".
Visit our project page for more details.
If you find this code useful in your research, please cite:
@inproceedings{karunratanakul2023dno,
title = {Optimizing Diffusion Noise Can Serve As Universal Motion Priors},
author = {Karunratanakul, Korrawe and Preechakul, Konpat and Aksan, Emre and Beeler, Thabo and Suwajanakorn, Supasorn and Tang, Siyu},
booktitle = {arxiv:2312.11994},
year = {2023}
}
📢 13/June/24 - Full code release with demo.
📢 9/May/24 - Initial release with functional generation and evaluation code.
Important: DNO is model agnostic and can be used with any diffusion model. The main file is dno.py
. The demo code for different tasks is in sample/gen_dno.py
.
This demo will show the result using MDM with Exponential Model Averaging (EMA) which we trained ourselves.
The environment setup is the same as GMD. If you already have a working environment, it should also work here.
This code was tested on Ubuntu 20.04 LTS
and requires:
Install ffmpeg (if not already installed):
sudo apt update
sudo apt install ffmpeg
For windows use this instead.
DNO uses the same dependencies as GMD so if you already install one, you can use the same environment here.
Setup conda env:
conda env create -f environment_gmd.yml
conda activate gmd
conda remove --force ffmpeg
python -m spacy download en_core_web_sm
pip install git+https://github.com/openai/CLIP.git
Download dependencies:
bash prepare/download_smpl_files.sh
bash prepare/download_glove.sh
bash prepare/download_t2m_evaluators.sh
There are two paths to get the data:
(a) Generation only wtih pretrained text-to-motion model without training or evaluating
(b) Get full data to train and evaluate the model.
HumanML3D - Clone HumanML3D, then copy the data dir to our repository:
cd ..
git clone https://github.com/EricGuo5513/HumanML3D.git
unzip ./HumanML3D/HumanML3D/texts.zip -d ./HumanML3D/HumanML3D/
cp -r HumanML3D/HumanML3D Diffusion-Noise-Optimization/dataset/HumanML3D
cd Diffusion-Noise-Optimization
HumanML3D - Follow the instructions in HumanML3D, then copy the result dataset to our repository:
Then copy the data to our repository
cp -r ../HumanML3D/HumanML3D ./dataset/HumanML3D
Download our version of MDM, then unzip and place it in ./save/
.
The model is trained on the HumanML3D dataset.
We provide a demo code for motion editing, in-filling (and in-beetweening), refinement, and blending tasks in sample/gen_dno.py
. The task can be selected by commenting or uncommenting from list on lines 54-58. The dense optimization task can be used for debugging and testing the optimization process.
Note: The only differences between these tasks are the reward/loss function and whether to start from DDIM inverted noise or random noise. The rest of the framework is the same.
The demo targets are currently hardcoded in sample/dno_helper.py
and can be modified to your own target (e.g. our own reward function or hardcoded target pose/locations).
In all tasks, the target pose and the mask need to be specified.
python -m sample.gen_dno --model_path ./save/mdm_avg_dno/model000500000_avg.pt --text_prompt "a person is jumping"
We can specify the initial motion by adding --load_from
to the command. The initial motion must be in the same format as the target motion.
python -m sample.gen_dno --model_path ./save/mdm_avg_dno/model000500000_avg.pt --text_prompt "a person is jumping" --load_from ./save/mdm_avg_dno/samples_000500000_avg_seed20_a_person_is_jumping/trajectory_editing_dno_ref
--seed
to specify the seed for the random noise and generation.For motion editing there is a UI for trajectory editing that can be used with the flag USE_GUI
as follows:
Target location at 90th frame, motion length 120 frames:
New target at 40th frame, start from previous output motion:
Lower head location at 90th frame:
Note: For editing, we need an inverted noise to start from. We use DDIM inversion on the input motion to get the inverted noise, however, this process is an approximation. If available, we can use the final noise from the previous optimization to avoid the approximation.
Starting from noisy version of the above motion:
"a person is walking forward" and "a person is jumping sideway":
Original motion generated with "a person is walking slightly to the left":
num_opt_steps
in DNOOption
and num_ode_steps
in gen_dno.py
.Running the generation command will get you:
results.npy
file with text prompts and xyz positions of the generated animationsample00_rep##.mp4
- a stick figure animation for each generated motion. rep00
is always the original motion.To create SMPL mesh per frame run:
python -m visualize.render_mesh --input_path /path/to/mp4/stick/figure/file
This script outputs:
sample##_rep##_smpl_params.npy
- SMPL parameters (thetas, root translations, vertices and faces)sample##_rep##_obj
- Mesh per frame in .obj
format.Notes:
.obj
can be integrated into Blender/Maya/3DS-MAX and rendered using them.--device
flag)..mp4
path before running the script.sample##_rep##_smpl_params.npy
(we always use beta=0 and the gender-neutral model).sample##_rep##_smpl_params.npy
file for your convenience.For automatic rendering with Blender:
sample##_rep##_smpl_params.npy
from SMPL fitting command above and render the full-body animation.The script will evaluate the model on the HumanML3D dataset by adding noise to the ground truth motion. This will produce the DNO-MDM results in Table 2 of our paper.
python -m eval.eval_refinement --model_path ./save/mdm_avg_dno/model000500000_avg.pt
The generation can be sped up by incresing the batch size in the evaluation()
function at the cost of GPU memory.
The script will generate motions from the given text prompt and randomly change a location in a single frame to a new location. This will produce the DNO-MDM results in Table 1 of our paper.
python -m eval.eval_edit --model_path ./save/mdm_avg_dno/model000500000_avg.pt --text_prompt "a person is jumping" --seed 10
We used the following text prompts for the evaluation in our paper, mainly because of its ease of defining whether the content is preserved: "a person is walking with raised hands", "a person is jumping", "a person is crawling", "a person is doing a long jump"
Our code is built upon many prior projects and would like to thank the following contributors for the great foundation:
GMD, MDM, guided-diffusion, MotionCLIP, text-to-motion, actor, joints2smpl, MoDi.
This code is distributed under an MIT LICENSE.
Note that our code depends on other libraries, including CLIP, SMPL, SMPL-X, PyTorch3D, and uses datasets that each have their own respective licenses that must also be followed.