harvard-edge / multilingual_kws

Few-shot Keyword Spotting in Any Language and Multilingual Spoken Word Corpus
155 stars 35 forks source link

Where can I find the used keywords (total 760) and splits from the paper? #41

Closed V0XNIHILI closed 1 year ago

V0XNIHILI commented 1 year ago

First of all, thanks a lot for this work! It is incredibly useful for training strong models for keyword spotting.

I would like to train with the same data as mentioned in the paper: where can I download this data or where can I download a list of the files used for training/validation/testing etc. (see image below for the dataset I am looking for0?

image

For example, the newest version of the dataset on mlcommons.org has now more than 340k keywords:

image

So is there such an overview somewhere? I couldnt find it on the mlcommons website or in this repo, but maybe I missed it somewhere.

V0XNIHILI commented 1 year ago

@mmaz could you help me out with this? Many thanks in advance!

mmaz commented 1 year ago

Hi, the list of samples used in the interspeech paper is not publicly archived, only the model weights and code are released. However, a similar set of samples can be selected from MSWC by following the per-language sample counts provided in the interspeech paper, and the MSWC alignments can be used to extract samples padded with contextual audio instead of silence from the original Common Voice data.

V0XNIHILI commented 1 year ago

Thanks for the clarification!