Can Language Models Learn to Listen?

This is the repo for the paper Can Language Models Learn to Listen?, appearing at ICCV 2023.

Setup Environment

Create a new Python 3 environment and install PyTorch 1.11.0 from https://pytorch.org/get-started/previous-versions/. Then install the requirements for this repo via pip install -r requirements.txt. Also, please clone the DECA (for visualization) and EMOCA (for emotion/valence evaluation) repositories, and set the following environment variables:

export PYTHONPATH=/PATH/TO/EMOCA/:$PYTHONPATH
export DECA_PATH=/PATH/TO/DECA/

You will need to change EMOCA emotion recogition to not process from image. In gdl/models/EmoDeca.py, add the following lines to the beginning of the forward method:

if 'image' not in batch:
    values = batch
else:
    values = self.deca.encode(batch, training=False)

You will also need to download the DECA and EMOCA models (there are instructions in those repos).

Data Preparation

Please download the data from the Google Driver folder: here. Place the data so that there are directories dataset/trevor, dataset/conan, dataset/stephen, and dataset/trevorconanstephen that have the corresponding segment files. Note: If you want to use a cross-speaker VQ to train an LM Listener for a speaker (as we did for Conan and Stephen), you should copy the corresponding speaker's directory and then overwrite the mean.npy and std.npy files with the files from the trevorconanstephen directory. For instance, for Conan, you should copy dataset/conan to dataset/conanglobal and then copy dataset/trevorconanstephen/{mean,std}.npy to dataset/conanglobal/.

Pre-trained model

We provide a pre-trained VQ model and LM Listener for Trevor Noah here.

Training

The following command will train a VQ encoder-decoder:

python3 train_vq.py \
    --batch-size 256 \
    --lr 2e-4 \
    --total-iter 300000 \
    --lr-scheduler 200000 \
    --nb-code 256 \
    --down-t 3 \
    --depth 3 \
    --window-size 32 \
    --dilation-growth-rate 3 \
    --out-dir output \
    --dataname face_{trevor/trevorconanstephen} \
    --vq-act relu \
    --quantizer ema_reset \
    --loss-vel 0.5 \
    --recons-loss l1_smooth \
    --exp-name VQVAE_{trevor/trevorconanstephen}

The following command will train an LM Listener:

python train_t2m_trans.py \
    --exp-name listener_{trevor/conanglobal/stephenglobal} \
    --batch-size 8 \
    --nb-code 256 \
    --drop-out-rate 0.1 \
    --resume-pth output/VQVAE_{trevor/trevorconanstephen}/net_iter300000.pth \
    --vq-name VQVAE_{trevor/trevorconanstephen} \
    --out-dir output \
    --total-iter 100000 \
    --lr-scheduler 150000 \
    --lr 0.00005 \
    --dataname face_realtalkv2 \
    --down-t 2 \
    --depth 3 \
    --quantizer ema_reset \
    --eval-iter 2000 \
    --pkeep 0.50 \
    --dilation-growth-rate 3 \
    --vq-act relu \
    --max-motion-length 240 \
    --gpt2 gpt2-medium \
    --print_val_pred \
    --gradient_accumulation_steps 2 \
    --manual-bf16 \
    --delay-start-frames 96 \
    --max-time-before 3

Generation

The following command can be used to generate prediction files (in .npy format) from a trained LM Listener:

python train_t2m_trans.py \
        --exp-name listener_{trevor/conanglobal/stephenglobal} \
        --batch-size 8 \
        --nb-code 256 \
        --drop-out-rate 0.1 \
        --resume-pth output/VQVAE_{trevor/trevorconanstephen}/net_iter300000.pth \
    --vq-name VQVAE_{trevor/trevorconanstephen} \
        --out-dir output \
        --total-iter 0 \
        --lr-scheduler 150000 \
        --lr 0.00005 \
        --dataname face_trevor \
        --down-t 3 \
        --depth 3 \
        --quantizer ema_reset \
        --eval-iter 2000 \
        --pkeep 0.50 \
        --dilation-growth-rate 3 \
        --vq-act relu \
        --max-motion-length 240 \
        --gpt2 gpt2-medium \
        --print_val_pred \
        --gradient_accumulation_steps 2 \
        --manual-bf16 \
        --delay-start-frames 96 \
        --max-time-before 3 \
        --save-name subdir_where_predictions_will_be_saved \
        --seed 50 \
        --resume-trans /path/to/model/checkpoint.pth

Evaluation

The following command can be used to compute evaluation metrics for an LM Listener:

python evaluate_listener.py --output_dir output/{EXPERIMENT_NAME} --segments_path dataset/{trevor/conanglobal/stephenglobal}/segments_val.pth --mean_std_path dataset/{trevor/conanglobal/stephenglobal}/

Baselines

To produce a directory of predictions for the Random VQ, Random Train, and Nearest Neighbor baselines, use the following command templates:

python baselines.py --vq-dir dataset/{trevor/conanglobal/stephenglobal}/vqvae_{trevor/trevorconanstephen}_val/ --output-dir output/{trevor/conan/stephen}_random_vq --params-path path_to_vq_config.json --max-motion-length 240 --history-size 3 --mean-std-path dataset/{trevor/conanglobal/stephenglobal}/
python baselines.py --vq-dir dataset/{trevor/conanglobal/stephenglobal}/vqvae_{trevor/trevorconanstephen}_val/ --output-dir output/{trevor/conan/stephen}_nearest_neighbor --params-path path_to_vq_config.json --max-motion-length 240 --history-size 3 --mean-std-path dataset/{trevor/conanglobal/stephenglobal}/ --train-segments-path dataset/{trevor/conanglobal/stephenglobal}/segments_train.pth --val-segments-path dataset/{trevor/conanglobal/stephenglobal}/segments_val.pth --nearest-neighbor --embedding-model-name sentence-transformers/all-mpnet-base-v2 --batch-size 32 --normalize
python baselines.py --vq-dir dataset/{trevor/conanglobal/stephenglobal}/vqvae_{trevor/trevorconanstephen}_val/ --output-dir output/{trevor/conan/stephen}_random_train --params-path path_to_vq_config.json --max-motion-length 240 --history-size 3 --mean-std-path dataset/{trevor/conanglobal/stephenglobal}/ --train-segments-path dataset/{trevor/conanglobal/stephenglobal}/segments_train.pth --val-segments-path dataset/{trevor/conanglobal/stephenglobal}/segments_val.pth

The format of the predictions is .npy, just like the predictions produced by the LM Listener.

Visualization

The following command can be used to generate visualizations for an LM Listener:

python visualize_listener.py --output_dir /path/to/output/dir/ --segments_path dataset/{trevor/conanglobal/stephenglobal}/segments_val.pth --default_code_path default_code_trevor_emoca2.pkl --params_path output/{EXPERIMENT_NAME}/config.json  --items output/{EXPERIMENT_NAME}/,vq,gt,video --mean_std_path dataset/{trevor/conanglobal/stephenglobal}/ --audio_root /path/to/raw/audios/ --video_root /path/to/raw/videos/ --fps 30

The --items parameter allows you to specify a comma-separated list of what to visualize. The options are: video (raw video), gt (the ground-truth EMOCA face reconstruction of the listener), vq (the VQ reconstruction of the listener), or a path to the output directory containing the predicted .npy files of an LM Listener.

Acknowledgements

Much of the code in this repo is taken from T2M-GPT.

Citation

@inproceedings{ng2023text2listen,
   title={Can Language Models Learn to Listen?}
   author={Ng, Evonne and Subramanian, Sanjay
            and Klein, Dan and Kanazawa, Angjoo
            and Darrell, Trevor and Ginosar, Shiry},
   booktitle={Proceedings of the International
            Conference on Computer Vision (ICCV)},
   year={2023}
}

sanjayss34 / lm-listener

readme