guxm2021 / SVT_SpeechBrain

[TOMM 2024] Automatic Lyric Transcription and Automatic Music Transcription from Multimodal Singing
Apache License 2.0
16 stars 4 forks source link

Automatic Lyric Transcription and Automatic Music Transcription from Multimodal Singing

This is the author's official PyTorch implementation for our TOMM paper:

Automatic Lyric Transcription and Automatic Music Transcription from Multimodal Singing

As this paper is a journal extension of our previous paper on multimodal ALT, we only include the code regarding AMT in this repo. Please refer to the code repo for ALT.

Project Description

Automatic lyric transcription (ALT) refers to transcribing singing voices into lyrics while automatic music transcription (AMT) refers to transcribing singing voices into note events, i.e., musical MIDI notes. Despite these two tasks having significant potential for practical application, they are still nascent. This is because the transcription of lyrics and note events solely from singing audio is notoriously difficult due to the presence of noise contamination, e.g., musical accompaniment, resulting in a degradation of both the intelligibility of sung lyrics and the recognizability of sung notes. To address this challenge, we propose a general framework for implementing multimodal ALT and AMT systems. Additionally, we curate the first multimodal singing dataset, comprising N20EMv1 and N20EMv2, which encompasses audio recordings and videos of lip movements, together with ground truth for lyrics and note events. For model construction, we propose adapting self-supervised learning models from the speech domain as acoustic encoders and visual encoders to alleviate the scarcity of labeled data. We also introduce a residual cross-attention mechanism to effectively integrate features from the audio and video modalities. Through extensive experiments, we demonstrate that our single-modal systems exhibit state-of-the-art performance on both ALT and AMT tasks. Subsequently, through single-modal experiments, we also explore the individual contributions of each modality to the multimodal system. Finally, we combine these and demonstrate the effectiveness of our proposed multimodal systems, particularly in terms of their noise robustness.

Method Overview

The following figure illustrates the framework of our multimodal ALT system or multimodal AMT system.

Installation

Environement

Install Anaconda and create the environment with python 3.8.12, pytorch 1.9.0 and cuda 11.1:

conda create -n amt python=3.8.12
conda activate amt
pip install torch==1.9.1+cu111 torchvision==0.10.1+cu111 torchaudio==0.9.1 -f https://download.pytorch.org/whl/torch_stable.html

SpeechBrain

We run experiments based on SpeechBrain toolkit. For simiplicity, we remove the original recipes. To install SpeechBrain, run following commands:

cd SVT_SpeechBrain
pip install -r requirements.txt
pip install --editable .

Transformers and other packages are also required:

pip install -r dependencies.txt

AV-Hubert

We adapt AV-Hubert (Audio-Visual Hidden Unit BERT) in our experiments. To enable the usage of AV-Hubert, run following commands:

cd ..
git clone https://github.com/facebookresearch/av_hubert.git
cd av_hubert
git submodule init
git submodule update

Fairseq and dependencies are also required:

pip install -r requirements.txt
cd fairseq
pip install --editable ./

Datasets

MIR-ST500

MIR-ST500 dataset is the largest AMT dataset for singing with manual annotations. MIR-ST500 has 500 Chinese pop songs (about 30 hours) including 400 songs for training and 100 songs for evaluation. To download and prepare this dataset, we follow its github website https://github.com/york135/singing_transcription_ICASSP2021.

TONAS

TONAS dataset is a evaluation set for AMT for singing. TONAS has 72 Flamenco songs (36 minutes in total duration). We download the dataset from this website https://www.upf.edu/web/mtg/tonas.

ISMIR2014

ISMIR2014 dataset is another evaluation set for AMT for singing. ISMIR2014 has 14 songs sung by children, 13 by male adults and 11 by female adults (38 pop songs, 19 minutes in total duration).

N20EMv2

N20EMv2 dataset is curated by ourselves for our multimodal AMT task. The dataset is avaiable at this website https://zenodo.org/records/10814703.

NOTE:

Training and Evaluation

We follow the internal logic of SpeechBrain, you can run experiments in this way:

cd <dataset>/<task>
python experiment.py params.yaml

You may need to create csv files according to our guidance in <dataset>/<task>. The results will be saved in the output_folder specified in the yaml file. Both detailed logs and experiment outputs are saved there. Furthermore, less verbose logs are output to stdout.

Citation

@article{gu2024automatic,
  title={Automatic Lyric Transcription and Automatic Music Transcription from Multimodal Singing}, 
  author={Gu, Xiangming and Ou, Longshen and Zeng, Wei and Zhang, Jianan and Wong, Nicholas and Wang, Ye},
  journal={ACM Transactions on Multimedia Computing, Communications and Applications},
  publisher={ACM New York, NY},
  year={2024}
}
@inproceedings{gu2022mm,
  title={Mm-alt: A multimodal automatic lyric transcription system},
  author={Gu, Xiangming and Ou, Longshen and Ong, Danielle and Wang, Ye},
  booktitle={Proceedings of the 30th ACM International Conference on Multimedia},
  pages={3328--3337},
  year={2022}
}

We borrow the code from SpeechBrain, please also consider citing their works.

Also Check Our Relevant Work

MM-ALT: A Multimodal Automatic Lyric Transcription System
Xiangming Gu, Longshen Ou, Danielle Ong, Ye Wang
ACM International Conference on Multimedia (ACM MM), 2022, (Oral)
[paper][code]

Elucidate Gender Fairness in Singing Voice Transcription
Xiangming Gu, Wei Zeng, Ye Wang
ACM International Conference on Multimedia (ACM MM), 2023
[paper]

License

SVT_SpeechBrain is released under the Apache License, version 2.0.