Audio Captioning with BEATs, Conformer & BART

Winning model of DCASE Challenge 2023 Task 6A, with the follow-up publication:

Improving Audio Captioning Models with Fine-grained Audio Features, Text Embedding Supervision, and LLM Mix-up Augmentation
Shih-Lun Wu, Xuankai Chang, Gordon Wichern, Jee-weon Jung, François Germain, Jonathan Le Roux, and Shinji Watanabe
Int. Conf. on Acoustics, Speech, and Signal Processing (ICASSP) 2024
[arXiv page] [DCASE results]

BibTex citation

@inproceedings{wu2024improving,
title={Improving Audio Captioning Models with Fine-grained Audio Features, Text Embedding Supervision, and LLM Mix-up Augmentation},
author={Wu, Shih-Lun and Chang, Xuankai and Wichern, Gordon and Jung, Jee-weon and Germain, Fran{\c{c}}ois and Le Roux, Jonathan and Watanabe, Shinji},
booktitle={Proc. Int. Conf. on Acoustics, Speech and Signal Processing (ICASSP)},
year={2024}
}

Install Packages

(Recommended) Create Conda environment with Python 3.9
Install PyTorch with the correct CUDA version

Install dependencies for SPICE metric

cd caption_evaluation_tools/coco_caption
bash get_stanford_models.sh
cd ../../

Install other dependencies
```
pip install -r requirements.txt
```

Download Dataset & Pretrained Model

Install p7zip (required for unpacking dataset)

# if using conda
conda install bioconda::p7zip
# if installing to system
# sudo apt-get install p7zip-full

Download Clotho dataset
```
bash download_clotho.sh
```

Install Git-LFS

# if using conda
conda install conda-forge::git-lfs
git-lfs install

# if installing to system
# curl -s https://packagecloud.io/install/repositories/github/git-lfs/script.deb.sh | sudo bash
# sudo apt-get install git-lfs
# git-lfs install

Get pretrained model (stored on HuggingFace)
```
bash download_model.sh
```

Reproduce Best Model Results

Run inference & evaluation code
```
bash run_sampling_reranking.sh
```
- metrics can then be found at ./exp/inference_evaluation_nucleus_t0.5_p95/inference_metrics.json

(Bonus) Augmented Dataset

Our 50K mix-up caption augmentations generated by ChatGPT (see paper Section 2.3 for details) can be found at:

https://huggingface.co/datasets/slseanwu/clotho-chatgpt-mixup-50K

Acknowledgements

Our model/repository would not have been possible without the following great open-source works. Thank you so much!

Clotho dataset: https://zenodo.org/records/4783391
BEATs audio encoder: https://github.com/microsoft/unilm/tree/master/beats
INSTRUCTOR LM embeddings: https://github.com/xlang-ai/instructor-embedding
Evaluation tools
- coco-caption: https://github.com/tylin/coco-caption
- caption-evaluation-tools: https://github.com/audio-captioning/caption-evaluation-tools
- fense: https://github.com/felixgontier/dcase-2023-baseline/tree/main/fense

slSeanWU / beats-conformer-bart-audio-captioner