physics-driven diffusion models for impact sound synthesis from videos (CVPR 2023)
Project Page
Paper Link
Code
Code is available now under the folder video-physics-sound-diffusion!
Pre-processed data and Pre-trained Weights Links
For questions or help, please open an issue or contact suk4 at uw dot edu
Requirement
- python=3.9
- pytorch=1.12.0
- torchaudio=0.12.0
- librosa=0.9.2
- auraloss=0.2.2
- einops=0.4.1
- einops-exts=0.0.3
- numpy=1.22.3
- opencv-python=4.6.0.66
Prepare Data
- Download the Greatest Hits dataset videos and metadata (txt files).
- Use video_to_frames.py to extract rgb frames from video and save as 224x224 images. Processed rgb frames are available in Google Drive (zip file name: rgb).
- Use video_to_wav.py to extract impact sound segments from videos. Extracted audio files are available in Google Drive (zip file name: audio_data).
- Use extract_physics_params.py to extract physics parameters from audio and save freq, power, decay rate, ground truth audio, and reconstructed audio as pickle file.
- Use train_test_split_process_video_data.py to segment the video frames and save train/test meta files. Processed meta files are in Google Drive (segmented_video_data_split).
- Note: we mainly use the subset annotated with hits action and static reaction. This ends up at ten representative classes of materials (glass, wood, ceramic, metal, cloth, plastic, drywall, carpet, paper, and rock) in a total of 10k impact sounds. You could also try using all sounds available in the dataset. While the annotations are noisy, we find that using the physics + residual combination can still reconstruct the audio reasonably.
Training and Inference for Sound Physics and Residual Prediction
- Check the sound_residual.yaml and change the data root or other settings if needed.
- Under the
video-physics-sound-diffusion
directory, run CUDA_VISIBLE_DEVICES=0 python tools/sound_residual_train.py --configs configs/sound_residual.yaml
- Once done with training, change the
resume_path
in sound_residual.yaml
to be your model path or use the pre-trained model here and you can run CUDA_VISIBLE_DEVICES=0 python tools/sound_residual_infer.py --cfg configs/sound_residual.yaml
to save both physics and predicted residual parameters as pickle file.
- [ ] TODO: Add a jupyter notebook to demonstrate how to reconstruct the sound.
Training for Physics-driven video to Impact Sound Diffusion
- You must obtain the audio physics and residual parameters before training the diffusion model.
- We use the visual features extracted from pre-trained resnet 50 + TSM classifier. We provide two types of features: 1) features before the classifier layer are available in here and the simple lower dimension logits here.
- Check the great_hits_spec_diff.yaml and change the data root or other settings if needed.
- Under the
video-physics-sound-diffusion
directory, run CUDA_VISIBLE_DEVICES=0 python tools/train.py --cfg configs/great_hits_spec_diff.yaml
Generating Samples
- Step 0: change the
resume_path
in great_hits_spec_diff.yaml
to be your model path.
- Step 1: Under the
video-physics-sound-diffusion
directory, run CUDA_VISIBLE_DEVICES=0 python tools/extract_latents.py --cfg configs/great_hits_spec_diff.yaml
to extract physics latents and save as pickle files.
- Step 2: Under the
video-physics-sound-diffusion
directory, run CUDA_VISIBLE_DEVICES=0 python tools/query_latents.py --cfg configs/great_hits_spec_diff.yaml
that will use test visual feature to query closest physics latent in training set.
- Step 3: Run
CUDA_VISIBLE_DEVICES=0 python tools/generate_samples.py --configs confings/great_hits_spec_diff.yaml
to generate wave file.
- Using Pre-trained Model: Please first download the processed data, then place them under the data_root you use in the config file. Also, download the model weights and place it under the logs folder. Then, run Step 3 to generate samples.
- [ ] TODO: Add a jupyter notebook for an easier demo.
Citation
If you find this repo useful for your research, please consider citing the paper
@inproceedings{su2023physics,
title={Physics-Driven Diffusion Models for Impact Sound Synthesis from Videos},
author={Su, Kun and Qian, Kaizhi and Shlizerman, Eli and Torralba, Antonio and Gan, Chuang},
booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
pages={9749--9759},
year={2023}
}
Acknowledgements
Part of the code is borrowed from the following repo and we would like to thank the authors for their contribution.
We would like to thank the authors of the Greatest Hits dataset for making this dataset possible.
We would like to thank Vinayak Agarwal for his suggestions on physics mode parameters estimation from raw audio.
We would like to thank the authors of DiffImpact for inspiring us to use the physics-based sound synthesis method to design physics priors as a conditional signal to guide the deep generative model synthesizes impact sounds from videos.