Audio-Synchronized Visual Animation

Lin Zhang¹, Shentong Mo², Yijing Zhang¹, Pedro Morgado¹

University of Wisconsin Madison¹
Carnegie Mellon University²

ECCV 2024
Oral Presentation

Checklist

[x] Release pretrained checkpoints
[x] Release inference code on audio-conditioned image animation and sync metrics
[x] Release ASVA training and evaluation code
[x] Release AVSync classifier training and evaluation code
[x] Release Huggingface Demo

1. Create environment

We use video_reader backend of torchvision to load audio and videos, which requires building torchvision locally

conda create -n asva python==3.10 -y
conda activate asva

pip install torch==2.1.0 torchaudio==2.1.0 --index-url https://download.pytorch.org/whl/cu121

# Build torchvision from source
mkdir -p submodules
cd submodules
git clone https://github.com/pytorch/vision.git
cd vision
git checkout tags/v0.16.0
conda install -c conda-forge 'ffmpeg<4.3' -y
python setup.py install
cd ../..

pip install -r requirements.txt

export PYTHONPATH=$PYTHONPATH:$(pwd):$(pwd)/submodules/ImageBind

2. Download pretrained models

Download required features/models

ImageBind: Pretrained frozen audio encoder
I3D: Evaluating FVD
Stable Diffusion V1.5: Load pretrained image generation model
AVID-CMA: Initialize AVSync Classifier's encoders
Precomputed null text encodings: Ease of computatoin

Please download and structure them as following:

- submodules/
    - ImageBind/
- pretrained/
    - i3d_torchscript.pt
    - stable-diffusion-v1-5/
    - openai-clip-l_null_text_encoding.pt
    - AVID-CMA_Audioset_InstX-N1024-PosW-N64-Top32_checkpoint.pth.tar

Download pretrained AVSyncD and AVSync Classifier checkpoints

Model	Dataset	Checkpoint	Config	Audio CFG	FVD	AlignSync
AVSyncD	AVSync15	GoogleDrive	Link	1.0	323.06	22.21
				4.0	300.82	22.64
				8.0	375.02	22.70
	Landscapes	GoogleDrive	Link	1.0	491.37	24.94
				4.0	449.59	25.02
				8.0	547.97	25.16
	TheGreatestHits	GoogleDrive	Link	1.0	305.41	22.56
				4.0	255.49	22.89
				8.0	279.12	23.14

Model	Dataset	Checkpoint	Config	A2V Sync Acc	V2A Sync Acc
AVSync Classifier	VGGSS	GoogleDrive	Link	40.76	40.86

Please download checkpoints you need and structure them as following:

- checkpoints/
    - audio-cond_animation/
        - avsync15_audio-cond_cfg/
        - landscapes_audio-cond_cfg/
        - thegreatesthits_audio-cond_cfg/
    - avsync/
        - vggss_sync_contrast/

3. Demo

Generate animation on audio / image / video

The program first tries to load audio from audio and image from image. If they are not specified, the program then loads audio or image from video.

python -W ignore scripts/animation_demo.py --dataset AVSync15 --category "lions roaring" --audio_guidance 4.0 \
    --audio ./assets/lions_roaring.wav --image ./assets/lion_and_gun.png --save_path ./assets/generation_lion_roaring.mp4

python -W ignore scripts/animation_demo.py --dataset AVSync15 --category "machine gun shooting" --audio_guidance 4.0 \
    --audio ./assets/machine_gun_shooting.wav --image ./assets/lion_and_gun.png --save_path ./assets/generation_lion_shooting_gun.mp4

Compute sync metrics for audio-video pairs

We have 3 metrics:

AVSync score

Raw output value of the avsync classifier for an input (audio, video) pair. It is in range (-\inf, \inf).

python -W ignore scripts/avsync_metric.py --metric avsync_score --audio {audio path} --video {video path}

RelSync

Measures synchronization of an (audio, video) pair by using a reference.

To measure synchronization audio generation, the reference is a groundtruth audio

python -W ignore scripts/avsync_metric.py --metric relsync --audio {generated audio path} --video {video path} --ref_audio {groundtruth audio path}

To measure synchronization video generation, the reference is a groundtruth video.

python -W ignore scripts/avsync_metric.py --metric relsync --audio {audio path} --video {generated video path} --ref_video {groundtruth video path}

AlignSync

Measures synchronization of an (audio, video) pair by using a reference video. It is only used to measure sync for video generation.

python -W ignore scripts/avsync_metric.py --metric alignsync --audio {audio path} --video {generated video path} --ref_video {groundtruth video path}

4. Download datasets

Each dataset has 3 files/folders:

videos/: the directory to store all .mp4 video files
train.txt: training file names
test.txt: testing file names

Optionally, we precomputed two files for ease of computation:

class_mapping.json: mapping category string in file name to text string used for conditioning
class_clip_text_encodings_stable-diffusion-v1-5.pt: mapping text string used for conditioning to clip text encodings

Download these files from GoogleDrive, and place them under datasets/ folder.

To download videos:

AVSync15: download videos from link above. (Last update: July 26 2024)
Landscapes: download videos from MMDiffusion.
TheGreatestHits: download videos from Visually Indicated Sounds.
VGGSS: for AVSync classifier training/evaluation, download videos from VGGSound. Only videos listed in train.txt and test.txt are needed.

Overall, the datasets folder has the following structure

- datasets/
    - AVSync15/
        - videos/
            - baby_babbling_crying/
            - cap_gun_shooting/
            - ...
        - train.txt
        - test.txt
        - class_mapping.json
        - class_clip_text_encodings_stable-diffusion-v1-5.pt
    - Landscapes/
        - videos/
            - train/
                - explosion
                - ...
            - test/
                - explosion
                - ...
            - ...
        - train.txt
        - test.txt
        - class_mapping.json
        - class_clip_text_encodings_stable-diffusion-v1-5.pt
    - TheGreatestHits/
        - videos/
            - xxxx_denoised_thumb.mp4
            - ...
        - train.txt
        - test.txt
        - class_clip_text_encodings_stable-diffusion-v1-5.pt
    - VGGSS/
        - videos/
            - air_conditioning_noise/
            - air_horn/
            - ...
        - train.txt
        - test.txt

5. Train and evaluate AVSyncD

Train

Training is done on 8 RTX-A4500 GPUs (20G) on AVSync15/Landscapes or 4 A100 GPUs on TheGreatestHits, with a total batch size of 64, accelerate for distributed training, and wandb for logging. Checkpoints will be flushed every checkpointing_steps iterations. Besides, checkpoints at the checkpointing_milestones-th iteration and the last iteration will both be saved. Please adjust these two parameters in .yaml config file to avoid important weights being flushed when you customize training recipes.

PYTHONWARNINGS="ignore" accelerate launch scripts/animation_train.py --config_file configs/audio-cond_animation/{datasetname}_audio-cond_cfg.yaml

Results are saved to exps/audio-cond_animation/{dataset}_audio-cond_cfg, with the same structure as pretrained checkpoints.

Evaluation

Evaluation is two-step:

Generate 3 clips per video for test set using scripts/animation_gen.py
Evaluate between groundtruth clips and generated clips using scripts/animation_eval.py

Please refer to scripts/animation_test_{dataset}.sh for the steps. For example, to evaluate AVSyncD pretrained on AVSync15 with audio guidance scale = 4.0:

bash scripts/animation_test_avsync15.sh checkpoints/audio-cond_animation/avsync15_audio-cond_cfg 37000 4.0

6. Train and evaluate AVSync Classifier

Train

AVSync Classifier is trained on VGGSS training split for 4 days, 8 RTX-A4500 GPUs, and batchsize 32.

PYTHONWARNINGS="ignore" accelerate launch scripts/avsync_train.py --config_file configs/avsync/vggss_sync_contrast.yaml

Evaluation

We follow VGGSoundSync to sample 31 clips from each video, with 0.04-s gap between neighboring clips. Given the audio/video clip at the center, we predict its synchronized video/audio clip's index. A tolerate range of 5 is applied, since human is tolerant to 0.2s asynchrony.

For example, to evaluate our pretrained AVSync Classifier on 8 GPUs, run:

PYTHONWARNINGS="ignore" accelerate launch --num_processes=8 scripts/avsync_eval.py --checkpoint checkpoints/avsync/vggss_sync_contrast/ckpts/checkpoint-40000/modules --mixed_precision fp16

Citation

Please consider citing our paper if you find this repo useful:

@inproceedings{linz2024asva,
    title={Audio-Synchronized Visual Animation},
    author={Lin Zhang and Shentong Mo and Yijing Zhang and Pedro Morgado},
    booktitle={Proceedings of the European Conference on Computer Vision (ECCV)},
    year={2024}
}

lzhangbj / ASVA

readme

Audio-Synchronized Visual Animation

Checklist

1. Create environment

2. Download pretrained models

Download required features/models

Download pretrained AVSyncD and AVSync Classifier checkpoints

3. Demo

Generate animation on audio / image / video

Compute sync metrics for audio-video pairs

AVSync score

RelSync

AlignSync

4. Download datasets

5. Train and evaluate AVSyncD

Train

Evaluation

6. Train and evaluate AVSync Classifier

Train

Evaluation

Citation