Winning model of DCASE Challenge 2023 Task 6A, with the follow-up publication:
@inproceedings{wu2024improving,
title={Improving Audio Captioning Models with Fine-grained Audio Features, Text Embedding Supervision, and LLM Mix-up Augmentation},
author={Wu, Shih-Lun and Chang, Xuankai and Wichern, Gordon and Jung, Jee-weon and Germain, Fran{\c{c}}ois and Le Roux, Jonathan and Watanabe, Shinji},
booktitle={Proc. Int. Conf. on Acoustics, Speech and Signal Processing (ICASSP)},
year={2024}
}
cd caption_evaluation_tools/coco_caption
bash get_stanford_models.sh
cd ../../
pip install -r requirements.txt
# if using conda
conda install bioconda::p7zip
# if installing to system
# sudo apt-get install p7zip-full
bash download_clotho.sh
Install Git-LFS
# if using conda
conda install conda-forge::git-lfs
git-lfs install
# if installing to system
# curl -s https://packagecloud.io/install/repositories/github/git-lfs/script.deb.sh | sudo bash
# sudo apt-get install git-lfs
# git-lfs install
bash download_model.sh
bash run_sampling_reranking.sh
./exp/inference_evaluation_nucleus_t0.5_p95/inference_metrics.json
Our 50K mix-up caption augmentations generated by ChatGPT (see paper Section 2.3 for details) can be found at:
Our model/repository would not have been possible without the following great open-source works. Thank you so much!
coco-caption
: https://github.com/tylin/coco-captioncaption-evaluation-tools
: https://github.com/audio-captioning/caption-evaluation-toolsfense
: https://github.com/felixgontier/dcase-2023-baseline/tree/main/fense