gusrud1103 / LibriPhrase

Recipe for LibriPhrase
MIT License
23 stars 4 forks source link

Recipe for LibriPhrase

:loudspeaker: [09/02/23] We released an official test dataset used in our paper. For the fair comparison, you may use this dataset.

About the dataset

"LibriPhrase" is an open-source dataset tailored for tasks like user-defined keyword spotting (UDKWS), open-vocabulary keyword spotting (OV-KWS), and spoken-term detection (STD). It is constructed based on the LibriSpeech corpus, and we adhere to the license terms of LibriSpeech.

In this recipe, we have taken advantage of exceptional tools and libraries, listed as follows:

Examples of LibriPhrase

Anchor text Easy negative text Hard negative text
friend guard
comfort
superior
frind
rend
trend
the river every morning
town with
not occurred
the giver
the liver
the rigor
i mean to and be made
be a banner
no less than
i seen to
i mean you
we mean to
at the right time began the kissing and
rubbing two bits of
conseil and land spent
at the same time
at the one time
knew the right time

Data pipeline

./libriphrase.py
./utils.py
./run.sh
./requirements.txt
./data/
  ├── librispeech_clean_dev_all_utt.csv
  ├── ...
  ├── librispeech_other_train_500h_all_utt.csv
  ├── ...
  ├── libriphrase_diffspk_all_1word.csv
  ├── ...
  └── libriphrase_diffspk_all_4word.csv

/LibriSpeech_ASR_corpus/
  ├── dev-clean/
  ├── ...
  └── train-other-500/
       ├── BOOKS.TXT
       ├── CHAPTERS.TXT
       ├── ...
       └── train-other-500/
            ├── 1006/
            ├── ...
            └── 985/
                ├── 126224/       
                ├── ...
                └── 126228/
                    ├── 985-126228-0000.wav
                    ...
/LibriPhrase_diffspk_all/
  ├── dev-clean/
  ├── ...
  └── train-other-500/
       └── train-other-500/
            ├── 1006/
            ├── ...
            └── 985/
                ├── 126224/       
                ├── ...
                └── 126228/
                    ├── 985-126228-0004_1word_0.wav
                    ...

Getting started

Environment

This work is performed in this environment.

1. Preparation

Before you start, make sure to prepare the LibriSpeech ASR corpus.
Once the download is complete, proceed to clone this repository and install the package dependencies using the following steps:

git clone https://github.com/gusrud1103/LibriPhrase.git
cd LibriPhrase
pip install -r requirements.txt

Afterwards, download alignment CSV files from Google Link and place the files into data folder.

mkdir data
cd data      # locate CSV files in this folder

2. Generating LibriPhrase

This recipe is designed to extract short phrases (consisting of 1\~4 words) from the LibriSpeech dataset and organize them into three categories: anchor, positive, and negative.
By controlling the mode argument in the script, you can customize the dataset's difficulty (easy/hard) and speaker type (same/different).
The script will create a hierarchical folder structure based on both the LibriSpeech dataset and manual arguments (Please refer the data pipeline).

You can use this recipe to generate test dataset. (We will also release the recipe for the training set soon.)
To ensure a fair comparison with our paper, we offer the official test dataset.
We generated the test data from “train-other-500” of LibriSpeech.

./run.sh

or

python3 libriphrase.py --libripath 'your path(librispeech wav files)' --newpath 'new path(libriphrase wav files)' --wordalign './data/librispeech_other_train_500h_all_utt.csv' --output './data/testset_librispeech_other_train_500h_short_phrase.csv' --numpair 3 --maxspk 1611 --maxword 4 --mode 'diffspk_all'

Arguments

3. Results

Columns:

Reference

[1] Vassil Panayotov, Guoguo Chen, Daniel Povey, and San-jeev Khudanpur, “Librispeech: an asr corpus based onpublic domain audio books,” in ICASSP, 2015.
[2] Loren Lugosch, Mirco Ravanelli, Patrick Ignoto, Vikrant Singh Tomar, and Yoshua Bengio, "Speech Model Pre-training for End-to-End Spoken Language Understanding", INTERSPEECH 2019.
[3] Michael McAuliffe, Michaela Socolof, Sarah Mihuc,Michael Wagner, and Morgan Sonderegger, “Montreal forced aligner: Trainable text-speech alignment using kaldi.,” in INTERSPEECH, 2017.
[4] Vladimir I Levenshtein et al., “Binary codes capable of correcting deletions, insertions, and reversals,” in Soviet physics doklady. Soviet Union, 1966, vol. 10, pp. 707–710.
[5] Jongseok Park, Kyubyong Kim, “g2pe,”https://github.com/Kyubyong/g2p, 2019.

License

Distributed under the MIT License. See LICENSE for more information.

Citation

If you use this code, please cite:

@inproceedings{Shin22LibriPhrase,
  author={Hyeon-Kyeong Shin and Hyewon Han and Doyeon Kim and Soo-Whan Chung and Hong-Goo Kang},
  title={Learning Audio-Text Agreement for Open-vocabulary Keyword Spotting},
  year=2022,
  booktitle={INTERSPEECH},
  pages={1871--1875},
  doi={10.21437/Interspeech.2022-580}
}