Audio Task(s) and Dataset(s)

twosixlabs / armory

ARMORY Adversarial Robustness Evaluation Test Bed

MIT License

174 stars 67 forks source link

Audio Task(s) and Dataset(s) #31

Closed davidslater closed 4 years ago

davidslater commented 4 years ago

It is unclear what the best datasets and tasks are for audio examples.

davidslater commented 4 years ago

(Carlini, 2018) use the Mozilla Common Voice dataset (speech-to-text): https://voice.mozilla.org/en/datasets

(Qin, 2019) use the LibriSpeech dataset (speech-to-text): https://www.openslr.org/12

The Mozilla dataset is definitely nicer to work with.

TIMIT is a simpler dataset, but requires becoming a member, which is annoying and costs $$$. https://catalog.ldc.upenn.edu/LDC93S1

The free spoken image dataset is basically MNIST for audio, and could be a good classification starting place: https://github.com/Jakobovski/free-spoken-digit-dataset

VoxCeleb would be a good place for a real-world speaker ID dataset for audio. http://www.robots.ox.ac.uk/~vgg/data/voxceleb/

davidslater commented 4 years ago

I am downloading the Mozilla Common Voice English dataset (30 GB) - it was really easy to get.

I think that if we can build a simple dataset out of it - call it Minizilla or something (similar to Imagenette) - that would be a great starting point. I'm going to see if it has sufficient label information to break it down into a classification task, but I'm guessing that it just has English transcripts.

davidslater commented 4 years ago

I am wary of starting with a speaker recognition scenario, as that can be a pretty challenging problem as you increase the number of speakers.

Phoneme classification may not be the easiest to turn into a specific task. I found links to Arabic and Persian phoneme material here, but the only English one was TIMIT: https://towardsdatascience.com/a-data-lakes-worth-of-audio-datasets-b45b88cd4ad

Free spoken digit might be the best starting point for a simple classification task.

davidslater commented 4 years ago

Free spoken digits is super easy, and probably a good starting point for audio. Next, I'm working on a simple example with spectrograms or some basic 1D conv net. Thoughts? It would introduce the performers to some of the challenges of working with audio. (For starters, all of the audio inputs, while only having a single output, are of different length.)

Mozilla Common Voice English dataset is easy to get (no $$$ or things you need to sign), and pretty basic - you get English transcripts, mp3 audio, and prescribed train/test splits. We couldn't use this for a simpler classification problem (it's basically only useful for end-to-end speech-to-text), but we could subset it pretty easily into a smaller dataset.

davidslater commented 4 years ago

Librispeech may be the way to go for a larger audio dataset.

While not available yet, https://www.tensorflow.org/datasets/catalog/overview

it is under active development by TensorFlow into a dataset: https://github.com/tensorflow/datasets/blob/master/tensorflow_datasets/audio/librispeech.py

davidslater commented 4 years ago

LibriSpeech merged in.