harvard-edge / multilingual_kws

Few-shot Keyword Spotting in Any Language and Multilingual Spoken Word Corpus
163 stars 37 forks source link

Reproducing paper results #39

Open sathibault opened 2 years ago

sathibault commented 2 years ago

I'm unable to train a working monolingual embedding model. Using the provided script (train_monolingual_embedding.py) with the top 165 English words yields the following results at the end of training: loss: 0.7145 - accuracy: 0.7711 - val_loss: 7.6774 - val_accuracy: 0.0586

Based on the paper, I was expecting something in the range of 70's for validation accuracy. Is it dependent on choosing the "right" words?

Could you please post a tutorial or maybe some of the missing files (e.g. train_files.txt, val_files.txt, test_files.txt, commands.txt) for reproducing the embedding?

I also notice that the file references seem to be to common voice rather than MSW. I'm using the English clips download from MSW which I'm assuming are the same. I've converted these to 16KHz, 16bit wav files using pydub which I guess is ffmpeg under the hood.