Commonvoice results misleading, complete overlap of train/dev/test sentences

kaldi-asr / kaldi

kaldi-asr/kaldi is the official location of the Kaldi project.

http://kaldi-asr.org

Other

14.11k stars 5.31k forks source link

Commonvoice results misleading, complete overlap of train/dev/test sentences #2141

Closed bmilde closed 6 years ago

bmilde commented 6 years ago

I was quite surprised to see how low the WERs are for the new Common Voice corpus: https://github.com/kaldi-asr/kaldi/blob/master/egs/commonvoice/s5/RESULTS (4ish% TDNN)

Unfortunately, these results seem to be bogus because there is a near complete overlap of train/dev/test sentences and the LM is only trained on the corpus train sentences (https://github.com/kaldi-asr/kaldi/blob/master/egs/commonvoice/s5/local/prepare_lm.sh). To make matters worse, there aren't really that many unique sentences in the corpus:

unique sentences in train: 6994 unique sentences in dev: 2410 unique sentences in test: 2362 common sentences train/dev (overlap) = 2401 common sentences train/test (overlap) = 2355

This can also be easily verified by e.g. grepping "sadly my dream of becoming a squirrel whisperer may never happen" on the original corpus csvs:

cv-valid-dev.csv:cv-valid-dev/sample-000070.mp3,sadly my dream of becoming a squirrel whisperer may never happen,3,0,seventies,male,us, cv-valid-dev.csv:cv-valid-dev/sample-000299.mp3,sadly my dream of becoming a squirrel whisperer may never happen,5,2,twenties,female,canada, cv-valid-dev.csv:cv-valid-dev/sample-002458.mp3,sadly my dream of becoming a squirrel whisperer may never happen,9,1,,,, cv-valid-dev.csv:cv-valid-dev/sample-003264.mp3,sadly my dream of becoming a squirrel whisperer may never happen,3,0,,,, cv-valid-dev.csv:cv-valid-dev/sample-003656.mp3,sadly my dream of becoming a squirrel whisperer may never happen,2,1,,,, grep: cv-valid-test: Is a directory cv-valid-test.csv:cv-valid-test/sample-000221.mp3,sadly my dream of becoming a squirrel whisperer may never happen,3,0,thirties,male,canada, cv-valid-test.csv:cv-valid-test/sample-001576.mp3,sadly my dream of becoming a squirrel whisperer may never happen,2,1,,,, cv-valid-test.csv:cv-valid-test/sample-002831.mp3,sadly my dream of becoming a squirrel whisperer may never happen,3,0,,,, cv-valid-test.csv:cv-valid-test/sample-003705.mp3,sadly my dream of becoming a squirrel whisperer may never happen,3,0,,,, cv-valid-test.csv:cv-valid-test/sample-003789.mp3,sadly my dream of becoming a squirrel whisperer may never happen,3,0,,,, grep: cv-valid-train: Is a directory cv-valid-train.csv:cv-valid-train/sample-000324.mp3,sadly my dream of becoming a squirrel whisperer may never happen,3,2,,,, cv-valid-train.csv:cv-valid-train/sample-000373.mp3,sadly my dream of becoming a squirrel whisperer may never happen,5,1,,,, cv-valid-train.csv:cv-valid-train/sample-000382.mp3,sadly my dream of becoming a squirrel whisperer may never happen,3,0,,,, cv-valid-train.csv:cv-valid-train/sample-001026.mp3,sadly my dream of becoming a squirrel whisperer may never happen,4,0,,,, cv-valid-train.csv:cv-valid-train/sample-003106.mp3,sadly my dream of becoming a squirrel whisperer may never happen,3,0,fourties,female,england, cv-valid-train.csv:cv-valid-train/sample-004591.mp3,sadly my dream of becoming a squirrel whisperer may never happen,3,0,,,, cv-valid-train.csv:cv-valid-train/sample-005048.mp3,sadly my dream of becoming a squirrel whisperer may never happen,3,0,,,, cv-valid-train.csv:cv-valid-train/sample-007144.mp3,sadly my dream of becoming a squirrel whisperer may never happen,3,0,,,, + 100s more...

Now this is pretty much a terrible design for a speech corpus, but I suggest to exclude the train sentences from the LM completely, to have somewhat more realistic results. I'm currently rerunning the scripts with a Cantab LM without the train sentences and will report back when I have the results.

entn-at commented 6 years ago

Well, that's the way the corpus was designed by the people at Mozilla (including the overlap in spoken content in the predefined train/dev/test splits). You should address your concerns to the people running this data collection effort (https://voice.mozilla.org/data, https://github.com/mozilla/voice-web). I believe Voxforge is similar in that there is an overlap in prompts.

This recipe is just using the corpus in its intended way (using the train/dev/test splits as provided). The results are not misleading, and to be sure, nobody is claiming that test results on this corpus somehow generalize to any other datasets/conditions.

EDIT: This isn't really a Kaldi problem; I suggest moving the discussion to kaldi-help: https://groups.google.com/forum/#!forum/kaldi-help

bmilde commented 6 years ago

Thanks for your fast reply! Also, I see that you (@entn-at) wrote the scripts for this corpus. Thank you very much for adapting them so fast!

I wrote similar scripts for Eesen (Rnn-ctc) and was puzzled at the WER difference (~16% vs ~4%), but I used Kaldi-lm and it doesn't let you finish training if it discovers the training sentences in dev, thus my difference was that I omitted them in LM training.

Despite this not being a Kaldi problem, I'd still suggest the following enhancements to the Kaldi Common Voice recipe:

Enhance the recipe by adding a true LM (e.g. trained on Cantab) so that the models can be used on other phrases then these 6000 sentences. This will make the models useable in speech recognition tasks (what Mozilla actually intended) despite the corpus problem. Right now you probably wouldn't be able to decode much else besides these exact sentences and tuning the recipe is pointless since the best results are probably obtained with a large LM/FST prior.
Add a comment about this peculiarity to the scripts and/or results
Maybe an option to train without the train sentences in the LM to get better generalization results. I'm currently doing this and the GMM-HMM WERs are basically 2x as large.

Ultimately, I agree that this is a corpus problem. I will address my concerns on the mozilla git and link to this Issue (would be nice to keep the discussion here open as well) - this is surely not what they had in mind.

jtrmal commented 6 years ago

I would suggest to do two different decoding passes probably. If the corpus was designed a specific way then for comparability there should be an easy way to get the reference numbers. I have certain doubts about the utility or usefulness of the Mozilla corpora anyway. Y.

On Jan 10, 2018 1:20 PM, "bmilde" notifications@github.com wrote:

Thanks for your fast reply! Also, I see that you (@entn-at https://github.com/entn-at) wrote the scripts for this corpus. Thank you very much for adapting them so fast!

Despite this not being a Kaldi problem, I'd still suggest the following enhancements to the Kaldi Common Voice recipe:

Enhance the recipe by adding a true LM (e.g. trained on Cantab) so that the models can be used on other phrases then these 6000 sentences. This will make the models useable in speech recognition tasks (what Mozilla actually intended) despite the corpus problem. Right now you probably wouldn't be able to decode much else besides these exact sentences and tuning the recipe is pointless since the best results are probably obtained with a large LM/FST prior.
Add a comment about this peculiarity to the scripts and/or results
Maybe an option to train without the train sentences in the LM to get better generalization results. I'm currently doing this and the GMM-HMM WERs are basically 2x as large.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/kaldi-asr/kaldi/issues/2141#issuecomment-356689463, or mute the thread https://github.com/notifications/unsubscribe-auth/AKisX4ndJmXpCR2anqXsqnH0qH3xxXIXks5tJP6YgaJpZM4RZojk .

entn-at commented 6 years ago

The recipe in its current form was more of a research exercise and not intended to produce a set of models fit for more general use/distribution (like the models published at http://www.kaldi-asr.org/models.html). I fully agree that your suggestions would make this recipe more valuable to other users and would be more in line with what the people at Mozilla had intended. With @danpovey's and @jtrmal's approval, we could create a second version (s5b?) of the recipe that includes your proposed changes. In that case, I would like to encourage you to make PRs with your changes using the Cantab LM.

Ultimately, though, I somewhat share @jtrmal's doubts regarding the corpus (with its current design). The people working on Mozilla's DeepSpeech implementation seem to be using various corpora for model training (including non-free corpora like Fisher, SWBD, see their import scripts). Perhaps a subset of the CV corpus could be added to egs/multi_en?

danpovey commented 6 years ago

I don't have super strong opinions about the structure... s5b would be a reasonable approach, I guess.

kaldi_lm may crash if in the metaparameter optimization it detects that your dev data was likely taken from the training data. The way this works, it likely wouldn't crash if even only small proportion of the dev data was distinct from training.

Likely this means that there is near 100% overlap.

On Wed, Jan 10, 2018 at 11:17 AM, Ewald Enzinger notifications@github.com wrote:

The recipe in its current form was more of a research exercise and not intended to produce a set of models fit for more general use/distribution (like the models published at http://www.kaldi-asr.org/models.html). I fully agree that your suggestions would make this recipe more valuable to other users and would be more in line with what the people at Mozilla had intended. With @danpovey https://github.com/danpovey's and @jtrmal https://github.com/jtrmal's approval, we could create a second version (s5b?) of the recipe that includes your proposed changes. In that case, I would like to encourage you to make PRs with your changes using the Cantab LM.

Ultimately, though, I somewhat share @jtrmal https://github.com/jtrmal's doubts regarding the corpus (with its current design). The people working on Mozilla's DeepSpeech implementation seem to be using various corpora for model training (including non-free corpora like Fisher, SWBD, see their import scripts). Perhaps the CV corpus could be added to egs/multi_en?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/kaldi-asr/kaldi/issues/2141#issuecomment-356707145, or mute the thread https://github.com/notifications/unsubscribe-auth/ADJVu0YWZMIHgZh0rsfYJLt137Gy4rq4ks5tJQzTgaJpZM4RZojk .

bmilde commented 6 years ago

Exactly, the overlap is 99.6%. Kaldi_lm rightfully refuses to work on that, I think it even gracefully showed an error message.

I've also posted the problem here on the mozilla discourse platform: https://discourse.mozilla.org/t/common-voice-v1-corpus-design-problems-overlapping-train-test-dev-sentences/24288

Chance are, they aren't really aware of it.

jtrmal commented 6 years ago

My preference would just adding another decoding script into s5 (and document that in README) to reduce duplication of things. But it's not worth my time to argue about this, so Ewald, your call. y.

On Thu, Jan 11, 2018 at 8:48 AM, bmilde notifications@github.com wrote:

Exactly, the overlap is 99.6%. Kaldi_lm rightfully refuses to work on that, I think it even gracefully showed an error message.

I've also posted the problem here on the mozilla discourse platform: https://discourse.mozilla.org/t/common-voice-v1-corpus- design-problems-overlapping-train-test-dev-sentences/24288

Chance are, they aren't really aware of it.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/kaldi-asr/kaldi/issues/2141#issuecomment-356938697, or mute the thread https://github.com/notifications/unsubscribe-auth/AKisX0c3xv4KpSyGa8Afz0Pgw9YfcRZ4ks5tJhEqgaJpZM4RZojk .

jtrmal commented 6 years ago

I'm not gonna discuss if kaldi_lm is rightful or not, but the usual discounting methods (KN, GT) have issues with "artificially" looking data, often artificially generated or grammar-induced data. Witten-Bell discounting should work in that case (probably not implemented in kaldi_lm). y.

On Thu, Jan 11, 2018 at 8:48 AM, bmilde notifications@github.com wrote:

Exactly, the overlap is 99.6%. Kaldi_lm rightfully refuses to work on that, I think it even gracefully showed an error message.

I've also posted the problem here on the mozilla discourse platform: https://discourse.mozilla.org/t/common-voice-v1-corpus- design-problems-overlapping-train-test-dev-sentences/24288

Chance are, they aren't really aware of it.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/kaldi-asr/kaldi/issues/2141#issuecomment-356938697, or mute the thread https://github.com/notifications/unsubscribe-auth/AKisX0c3xv4KpSyGa8Afz0Pgw9YfcRZ4ks5tJhEqgaJpZM4RZojk .

mikehenrty commented 6 years ago

Hi Kaldi folks, I'm the maintainer of the Common Voice project, and as such am responsible for both our small text corpus and improper split.

First of all, let me say that it was great to see Common Voice data integrated into the Kaldi project so quickly. Also, big thanks to @bmilde for finding and reporting this bug, and for suggesting a way to fix this on our repo. For a discussion of our plan for addressing this, you can check out that bug.

One thing I noticed in this thread:

I have certain doubts about the utility or usefulness of the Mozilla corpora anyway.

Both @jtrmal and @entn-at seemed to have this sentiment (perhaps others?). Are the problems other than our small corpus size and train/dev/test split that you are concerned about? We are in the process of updating Common Voice now, and it would be super helpful to get your feedback so we can make our data more useful to your project (and other STT engines).

On the topic of text corpora, we are working with the University of Illinois to get them to release the Flickr30K text corpus under cc-0, which would allow us to use it's 100K some sentences. For an example of what these look like, you can see here: https://raw.githubusercontent.com/mozilla/voice-web/master/server/data/not-used/flickr30k.txt

If it's not too much to ask, we would love expert feedback on if the above corpus would be helpful for speech engines. You'll notice the sentences are all description of images, so the utterances have a lot of repeated words. Unfortunately, I am not a speech technologist, so I don't have any intuition as to whether this is the right data or not for utterances. Again, any info you could provide would be very helpful.

Thanks again, and thanks for maintaining Kaldi!

ognjentodic commented 6 years ago

Here are some suggestions:

use wav files (16kHz, 16bits), not mp3
if possible, disable OS audio feedback in the app that collects the data; in a number of recordings I've noticed that there were different types of beeps in the beginning; later on I realized this may have been due to OS audio feedback, eg. when tapping a button, on Android
keep speaker information (some sort of hash), across different recordings
less focus on reading, more on spontaneous speech (this will end up being more costly, since speech data will need to be transcribed); you can ask people to talk about a number of different topics... have two people talk to each other, or person talk to a bot; you can also take a look at LDC website and get some inspiration from datasets collected in the past (I would not necessarily blindly follow their approaches though)
additional metadata that could be useful: tag/labels for extraneous speech/noise, etc; device info; user demographics; headset/bluetooth
capture data in different acoustic environments (and when possible, capture metadata about the environment as well)

sikoried commented 6 years ago

As Ognjen said, unless you're looking for models to be used in a command/control scenario (like Echo or Home), but a more general model, you'd be looking for spontaneous speech, ideally among people (of balanced age and gender). The hard part is transcribing it, since badly (or inconsistently) transcribed data is not of much use (which is also the reason why well transcribed data is so expensive). Here's a comparable real life analogy: say you're learning a new language; you may understand the news on TV (read-out, professional speakers), but have no clue when somebody talks to you on the street (spontaneous, slang, accent, background noise, ...).

Korbinian.

On Fri, Jan 26, 2018 at 6:59 PM, Ognjen Todic notifications@github.com wrote:

Here are some suggestions:

use wav files (16kHz, 16bits), not mp3

if possible, disable OS audio feedback in the app that collects the data; in a number of recordings I've noticed that there were different types of beeps in the beginning; later on I realized this may have been due to OS audio feedback, eg. when tapping a button, on Android

keep speaker information (some sort of hash), across different recordings

less focus on reading, more on spontaneous speech (this will end up being more costly, since speech data will need to be transcribed); you can ask people to talk about a number of different topics... have two people talk to each other, or person talk to a bot; you can also take a look at LDC website and get some inspiration from datasets collected in the past (I would not necessarily blindly follow their approaches though)

additional metadata that could be useful: tag/labels for extraneous speech/noise, etc; device info; user demographics; headset/bluetooth

capture data in different acoustic environments (and when possible, capture metadata about the environment as well)

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/kaldi-asr/kaldi/issues/2141#issuecomment-360857740, or mute the thread https://github.com/notifications/unsubscribe-auth/ADhueJhnAdRvAmvl5j6OMMbOAIxDjcSZks5tOhJ7gaJpZM4RZojk .

vdp commented 6 years ago

I completely agree with Ognjen's and Korbinian's suggestions, especially the one about conversational speech. This is indeed the major missing piece when it comes to open speech corpora. At this point collecting more scripted English is not going to be very useful IMO. Except maybe for distant speech or speech in noisy environments, but I'm not familiar with these domains- perhaps someone who is researching robust recognition will chime in. I'd say for clean read English speech the data problem is pretty much solved. For example the LibriSpeech's results for non-accented English, i.e. {dev,test}-clean, are already fairly close to the theoretical minimum- keep in mind that some of those errors are actually due to different spellings of some names etc. As it was already pointed out a conversational corpus is going to be much more costly in terms of corpus design and data collection effort, as well as transcription fees, but in contrast to read speech it's going to be very useful. Aside from the cost conversational speech is hard to come by as it's rarely released in public, because it tends to be private in nature.

If English conversational speech is not an option, then I would suggest to at least concentrate on languages other than English. The coverage there is more than spotty, so even read speech is going to be useful.

entn-at commented 6 years ago

The prompted nature of the collected speech was my main concern as well. Perhaps you can collect transcribed conversational speech via crowd sourcing as follows (I'm posting some ideas here and explicitly want to invite constructive criticism/comments by everybody; I know this has nothing to do with Kaldi per se, but people here have a lot of experience with data collected for ASR and can give valuable advice to @mikehenrty):

To collect conversational speech, let users call other users via WebRTC. Conversation will be easier if users already know each other. There are ways of recording a WebRTC call, for example using RTCMultiConnection and RecordRTC. AFAIK, WebRTC gateways like the open-source Janus project have plugins for recording calls. Each caller needs to be informed that the call is being recorded (and for what purpose), of course. There are obvious privacy issues for releasing the data and crowd sourcing the transcription effort, and reminders not to disclose private information only go so far.
For word-level transcription, use existing LVCSR systems (e.g. based on Kaldi, Mozilla DeepSpeech, or cloud speech APIs) to segment calls (i.e., create relatively short chunks) and transcribe them. Human listeners verifying these automatically transcribed segments could listen to segments and provide feedback in multiple ways:
1. Transcription correct: yes/no. This is the least-effort feedback option.
2. Add buttons (or click-/tap-able areas) for each word and between words. Users can click/tap on words to indicate insertion/substitution or between words to indicate deletions. More effort required by the listener.
3. In addition to (ii.), let users correct incorrectly transcribed segments/enter fully manual transcription. Higher effort required.

Keeping the required effort low is important for crowd sourcing, but as has been pointed out, badly or inconsistently transcribed speech isn't that useful. There is no question that a professionally designed and transcribed speech corpus is preferable in every way, but I'd be interested in feedback/suggestions (also and especially in the form of "This won't work because...").

Disclaimer: I'm not involved in this data collection effort, I'm just trying to be helpful.

mikehenrty commented 6 years ago

Wow, great feedback and ideas here everyone. Thank you for lending us your brains. I agree with @entn-at that perhaps the Kaldi github is not the best place to discuss making Common Voice better (for instance, I would rather see this on our Discourse channel 🤓 ). That said, part of the goals of Common Voice is to make open source speech technology better, so it's useful for us to come to where Kaldi folks are. I apologize if it feels like we are hijacking this thread.

Ok, on to the suggestions. I would like to try to comment on all the thoughts and ideas I see here. If I miss anything, I apologize. Also, if anyone thinks of anything else to add, by all means keep the ideas coming!

use wav files (16kHz, 16bits), not mp3

We record the audio from a variety of browsers, OSes, and devices. Sadly this gives us audio in many formats and bit rates. In addition to this, we must support audio playback on these devices (for human audio validation). MP3 gave us a good trade-off between browser/device support (so we didn't have to transcode on the fly every time a user wanted to listen/validate a clip), size of file (for downloading data), and quality. We spoke with both our internal Deepspeech team, as well a speech researcher at SNIPS.ai (a speech start-up), and neither team seemed concerned about the file format (artifacts and all) or bitrate. I would love to hear some thoughts about how important this is for Kaldi (or any other speech projects for that matter).

if possible, disable OS audio feedback in the app that collects the data; in a number of recordings I've noticed that there were different types of beeps in the beginning; later on I realized this may have been due to OS audio feedback, eg. when tapping a button, on Android

Our deepspeech team specifically did not want us to remove this from the data, the argument being it would make the resultant engine more resilient to this kind of thing.

keep speaker information (some sort of hash), across different recordings

We do indeed have this information, but have opted not to release it just yet as we are still trying to understand privacy implications. Will these speaker id's be useful for speech-to-text? We realize they are useful for things like speaker identification and/or speech synthesis, but that is not the focus for Common Voice at this time.

less focus on reading, more on spontaneous speech...

Good suggestion on taking inspiration from LDC. Indeed, we want to create something like the Fisher Corpus using our website, but that requires a rethink of our current (admittedly simplistic) interaction model. Big thanks to @entn-at for the thoughtful comments on how we could make this work. I completely agree that level of effort is something we need to pay close attention to. And if we can make this somehow fun, or useful in a way besides providing data (like talking with friend), then we are on the right track.

To this end, we are currently in the design process for "Collecting Organic Speech." We started with many big and sometimes crazy ideas (accent trainers, karaoke dating apps, a necklace with a button that can submit last 15 seconds of audio), and narrowed in on a few ideas we want to explore. Our current thinking is that we want to connect individuals who use the site and have them speak to each other somehow. We also want this to be fun, so we will have prompts and perhaps games (e.g. "Draw Something," but with audio).

That said, the time horizon would be late 2018 at the earliest. Our current engineering focus is on making Common Voice multi-language, and also increasing engagement on the site.

additional metadata that could be useful: tag/labels for extraneous speech/noise, etc; device info; user demographics; headset/bluetooth

Good idea! We have a bug for this: https://github.com/mozilla/voice-web/issues/814

capture data in different acoustic environments (and when possible, capture metadata about the environment as well)

Right now we know browser, mobile vs. desktop, and sometimes OS. Is there any other metadata you'd like to see?

jtrmal commented 6 years ago

I think the main problem (at least IMO) is that you got the whole kinda backwards -- IMO you should start with a solid use case and then drive the corpora acquisition w.r.t. to the use case. After that, you can start thinking of expanding -- language-wise, use-case-wise... For example, in what way wasn't the librispeech corpus and/or the models sufficient, if you actually tested it? Or other English models/corpora? Why did you decide to go for English -- do you know there are a couple of solid AMs freely available and scripts (and source corpora in some cases) for training it.

The use cases are very important as they will drive the way you will gather the corpus. Doing it the other way -- recording speech and hoping that some machine-learning magic (which is/was my impression about the way you did it) will make it useful can end up in a bitter disappointment. Lacking evident use-case and certain design naivety was the reason the corpus does not look very useful.

For example, consider that the metadata can be very useful and it would be worthy of considering what metadata record right in the early stages of the corpus design. If this is done right, you can provide fairly cheaply (storage-wise, computational-wise... ) specific (adapted) models for a given platform (mobile/desktop) or some other "slice" of the hw/sw ecosystem. Or speaker-adapted models in the longer term. Yes, the metadata are useful.

Compared to this, the fact if it's recorded in mp3 or that you can hear noises in the background is not something I would care too much about. It is better to record lossless but I don't think it should be a pivot. (My personal opinion).

On Mon, Jan 29, 2018 at 12:07 PM, Michael Henretty <notifications@github.com

wrote:

Wow, great feedback and ideas here everyone. Thank you for lending us your brains. I agree with @entn-at https://github.com/entn-at that perhaps the Kaldi github is not the best place to discuss making Common Voice better (for instance, I would rather see this on our Discourse channel https://discourse.mozilla-community.org/c/voice 🤓 ). That said, part of the goals of Common Voice is to make open source speech technology better, so it's useful for us to come to where Kaldi folks are. I apologize if it feels like we are hijacking this thread.

Ok, on to the suggestions. I would like to try to comment on all the thoughts and ideas I see here. If I miss anything, I apologize. Also, if anyone thinks of anything else to add, by all means keep the ideas coming!

use wav files (16kHz, 16bits), not mp3

We record the audio from a variety of browsers, OSes, and devices. Sadly this gives us audio in many formats and bit rates. In addition to this, we must support audio playback on these devices (for human audio validation). MP3 gave us a good trade-off between browser/device support (so we didn't have to transcode on the fly every time a user wanted to listen/validate a clip), size of file (for downloading data), and quality. We spoke with both our internal Deepspeech team, as well a speech researcher at SNIPS.ai (a speech start-up), and neither team seemed concerned about the file format (artifacts and all) or bitrate. I would love to hear some thoughts about how important this is for Kaldi (or any other speech projects for that matter).

if possible, disable OS audio feedback in the app that collects the data; in a number of recordings I've noticed that there were different types of beeps in the beginning; later on I realized this may have been due to OS audio feedback, eg. when tapping a button, on Android

Our deepspeech team specifically did not want us to remove this from the data, the argument being it would make the resultant engine more resilient to this kind of thing.

keep speaker information (some sort of hash), across different recordings

We do indeed have this information, but have opted not to release it just yet as we are still trying to understand privacy implications. Will these speaker id's be useful for speech-to-text? We realize they are useful for things like speaker identification and/or speech synthesis, but that is not the focus for Common Voice at this time.

less focus on reading, more on spontaneous speech...

Good suggestion on taking inspiration from LDC. Indeed, we want to create something like the Fisher Corpus using our website, but that requires a rethink of our current (admittedly simplistic) interaction model. Big thanks to @entn-at https://github.com/entn-at for the thoughtful comments on how we could make this work. I completely agree that level of effort is something we need to pay close attention to. And if we can make this somehow fun, or useful in a way besides providing data (like talking with friend), then we are on the right track.

To this end, we are currently in the design process for "Collecting Organic Speech." We started with many big and sometimes crazy ideas (accent trainers, karaoke dating apps, a necklace with a button that can submit last 15 seconds of audio), and narrowed in on a few ideas we want to explore. Our current thinking is that we want to connect individuals who use the site and have them speak to each other somehow. We also want this to be fun, so we will have prompts and perhaps games (e.g. "Draw Something https://www.zynga.com/games/draw-something," but with audio).

That said, the time horizon would be late 2018 at the earliest. Our current engineering focus is on making Common Voice multi-language, and also increasing engagement on the site.

additional metadata that could be useful: tag/labels for extraneous speech/noise, etc; device info; user demographics; headset/bluetooth

Good idea! We have a bug for this: mozilla/voice-web#814 https://github.com/mozilla/voice-web/issues/814

capture data in different acoustic environments (and when possible, capture metadata about the environment as well)

Right now we know browser, mobile vs. desktop, and sometimes OS. Is there any other metadata you'd like to see?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/kaldi-asr/kaldi/issues/2141#issuecomment-361314446, or mute the thread https://github.com/notifications/unsubscribe-auth/AKisX-f9d1qTe_7O-gGaB_xwW_GwlT6Gks5tPfq2gaJpZM4RZojk .

jtrmal commented 6 years ago

Lacking evident use-case and certain design naivety was the reason the corpus does not look very useful. I should say Lacking evident use-case and certain design naivety was the reason why I said the corpus does not look very useful. y.

On Tue, Jan 30, 2018 at 9:51 PM, Jan Trmal jtrmal@gmail.com wrote:

I think the main problem (at least IMO) is that you got the whole kinda backwards -- IMO you should start with a solid use case and then drive the corpora acquisition w.r.t. to the use case. After that, you can start thinking of expanding -- language-wise, use-case-wise... For example, in what way wasn't the librispeech corpus and/or the models sufficient, if you actually tested it? Or other English models/corpora? Why did you decide to go for English -- do you know there are a couple of solid AMs freely available and scripts (and source corpora in some cases) for training it.

The use cases are very important as they will drive the way you will gather the corpus. Doing it the other way -- recording speech and hoping that some machine-learning magic (which is/was my impression about the way you did it) will make it useful can end up in a bitter disappointment. Lacking evident use-case and certain design naivety was the reason the corpus does not look very useful.

For example, consider that the metadata can be very useful and it would be worthy of considering what metadata record right in the early stages of the corpus design. If this is done right, you can provide fairly cheaply (storage-wise, computational-wise... ) specific (adapted) models for a given platform (mobile/desktop) or some other "slice" of the hw/sw ecosystem. Or speaker-adapted models in the longer term. Yes, the metadata are useful.

Compared to this, the fact if it's recorded in mp3 or that you can hear noises in the background is not something I would care too much about. It is better to record lossless but I don't think it should be a pivot. (My personal opinion).

y.

On Mon, Jan 29, 2018 at 12:07 PM, Michael Henretty < notifications@github.com> wrote:

Wow, great feedback and ideas here everyone. Thank you for lending us your brains. I agree with @entn-at https://github.com/entn-at that perhaps the Kaldi github is not the best place to discuss making Common Voice better (for instance, I would rather see this on our Discourse channel https://discourse.mozilla-community.org/c/voice 🤓 ). That said, part of the goals of Common Voice is to make open source speech technology better, so it's useful for us to come to where Kaldi folks are. I apologize if it feels like we are hijacking this thread.

Ok, on to the suggestions. I would like to try to comment on all the thoughts and ideas I see here. If I miss anything, I apologize. Also, if anyone thinks of anything else to add, by all means keep the ideas coming!

use wav files (16kHz, 16bits), not mp3

We record the audio from a variety of browsers, OSes, and devices. Sadly this gives us audio in many formats and bit rates. In addition to this, we must support audio playback on these devices (for human audio validation). MP3 gave us a good trade-off between browser/device support (so we didn't have to transcode on the fly every time a user wanted to listen/validate a clip), size of file (for downloading data), and quality. We spoke with both our internal Deepspeech team, as well a speech researcher at SNIPS.ai (a speech start-up), and neither team seemed concerned about the file format (artifacts and all) or bitrate. I would love to hear some thoughts about how important this is for Kaldi (or any other speech projects for that matter).

if possible, disable OS audio feedback in the app that collects the data; in a number of recordings I've noticed that there were different types of beeps in the beginning; later on I realized this may have been due to OS audio feedback, eg. when tapping a button, on Android

Our deepspeech team specifically did not want us to remove this from the data, the argument being it would make the resultant engine more resilient to this kind of thing.

keep speaker information (some sort of hash), across different recordings

We do indeed have this information, but have opted not to release it just yet as we are still trying to understand privacy implications. Will these speaker id's be useful for speech-to-text? We realize they are useful for things like speaker identification and/or speech synthesis, but that is not the focus for Common Voice at this time.

less focus on reading, more on spontaneous speech...

Good suggestion on taking inspiration from LDC. Indeed, we want to create something like the Fisher Corpus using our website, but that requires a rethink of our current (admittedly simplistic) interaction model. Big thanks to @entn-at https://github.com/entn-at for the thoughtful comments on how we could make this work. I completely agree that level of effort is something we need to pay close attention to. And if we can make this somehow fun, or useful in a way besides providing data (like talking with friend), then we are on the right track.

To this end, we are currently in the design process for "Collecting Organic Speech." We started with many big and sometimes crazy ideas (accent trainers, karaoke dating apps, a necklace with a button that can submit last 15 seconds of audio), and narrowed in on a few ideas we want to explore. Our current thinking is that we want to connect individuals who use the site and have them speak to each other somehow. We also want this to be fun, so we will have prompts and perhaps games (e.g. "Draw Something https://www.zynga.com/games/draw-something," but with audio).

That said, the time horizon would be late 2018 at the earliest. Our current engineering focus is on making Common Voice multi-language, and also increasing engagement on the site.

additional metadata that could be useful: tag/labels for extraneous speech/noise, etc; device info; user demographics; headset/bluetooth

Good idea! We have a bug for this: mozilla/voice-web#814 https://github.com/mozilla/voice-web/issues/814

capture data in different acoustic environments (and when possible, capture metadata about the environment as well)

Right now we know browser, mobile vs. desktop, and sometimes OS. Is there any other metadata you'd like to see?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/kaldi-asr/kaldi/issues/2141#issuecomment-361314446, or mute the thread https://github.com/notifications/unsubscribe-auth/AKisX-f9d1qTe_7O-gGaB_xwW_GwlT6Gks5tPfq2gaJpZM4RZojk .

galv commented 6 years ago

I'm not actively participating in this conversation, but I do want to comment that "shorten" is probably the best lossless audio codec I know of in speech, if you're concerned about the size of wav files.

By the way, @jtrmal, from my time at NIPS, there seems to be a trend that having more data is important for futures directions of research (this work comes to mind http://research.baidu.com/deep-learning-scaling-predictable-empirically/: i.e., loss appears to go down logarithmically with your data size). A lot of people do think that larger datasets are important. It would be interesting to understand where Baidu's 10,000 hour dataset comes from. Last I heard (two years ago, mind!), they got a lot via Mechanical Turk jobs where people spoke sentences shown to them. If the large datasets are built using very similar data, this work research direction might be less interesting than people are assuming.

Also, I do remember that when I got started in speech recognition, there were no serious open datasets (maybe AMI or TEDLIUM?), HTK didn't release any recipes, and Kaldi was still hosted in an svn repo. So I do think there is educational value to an open dataset. Though of course librispeech has already serves the purpose of an educational dataset quite a bit.

Finally, for some reason many decision makers I've met seem to be irrationally opposed to buying corpora from the LDC. I agree the prices are not feasible for individuals, but the prices ($7k for 2000 hours for SWBD?) are fairly reasonable when contrasted with an institution's costs for engineers, scientists, and computer equipment. Of course, I'm not sure it's fair that Mozilla should be footing the bill by doing a lot of this work for free, either... (I'm watching their DeepSpeech repo; it is quite popular)

I've said far more than I expected. Oh well.

On Tue, Jan 30, 2018 at 6:54 PM, jtrmal notifications@github.com wrote:

Lacking evident use-case and certain design naivety was the reason the corpus does not look very useful. I should say Lacking evident use-case and certain design naivety was the reason why I said the corpus does not look very useful. y.

On Tue, Jan 30, 2018 at 9:51 PM, Jan Trmal jtrmal@gmail.com wrote:

I think the main problem (at least IMO) is that you got the whole kinda backwards -- IMO you should start with a solid use case and then drive the corpora acquisition w.r.t. to the use case. After that, you can start thinking of expanding -- language-wise, use-case-wise... For example, in what way wasn't the librispeech corpus and/or the models sufficient, if you actually tested it? Or other English models/corpora? Why did you decide to go for English -- do you know there are a couple of solid AMs freely available and scripts (and source corpora in some cases) for training it.

The use cases are very important as they will drive the way you will gather the corpus. Doing it the other way -- recording speech and hoping that some machine-learning magic (which is/was my impression about the way you did it) will make it useful can end up in a bitter disappointment. Lacking evident use-case and certain design naivety was the reason the corpus does not look very useful.

For example, consider that the metadata can be very useful and it would be worthy of considering what metadata record right in the early stages of the corpus design. If this is done right, you can provide fairly cheaply (storage-wise, computational-wise... ) specific (adapted) models for a given platform (mobile/desktop) or some other "slice" of the hw/sw ecosystem. Or speaker-adapted models in the longer term. Yes, the metadata are useful.

Compared to this, the fact if it's recorded in mp3 or that you can hear noises in the background is not something I would care too much about. It is better to record lossless but I don't think it should be a pivot. (My personal opinion).

y.

On Mon, Jan 29, 2018 at 12:07 PM, Michael Henretty < notifications@github.com> wrote:

Wow, great feedback and ideas here everyone. Thank you for lending us your brains. I agree with @entn-at https://github.com/entn-at that perhaps the Kaldi github is not the best place to discuss making Common Voice better (for instance, I would rather see this on our Discourse channel https://discourse.mozilla-community.org/c/voice 🤓 ). That said, part of the goals of Common Voice is to make open source speech technology better, so it's useful for us to come to where Kaldi folks are. I apologize if it feels like we are hijacking this thread.

Ok, on to the suggestions. I would like to try to comment on all the thoughts and ideas I see here. If I miss anything, I apologize. Also, if anyone thinks of anything else to add, by all means keep the ideas coming!

use wav files (16kHz, 16bits), not mp3

We record the audio from a variety of browsers, OSes, and devices. Sadly this gives us audio in many formats and bit rates. In addition to this, we must support audio playback on these devices (for human audio validation). MP3 gave us a good trade-off between browser/device support (so we didn't have to transcode on the fly every time a user wanted to listen/validate a clip), size of file (for downloading data), and quality. We spoke with both our internal Deepspeech team, as well a speech researcher at SNIPS.ai (a speech start-up), and neither team seemed concerned about the file format (artifacts and all) or bitrate. I would love to hear some thoughts about how important this is for Kaldi (or any other speech projects for that matter).

if possible, disable OS audio feedback in the app that collects the data; in a number of recordings I've noticed that there were different types of beeps in the beginning; later on I realized this may have been due to OS audio feedback, eg. when tapping a button, on Android

Our deepspeech team specifically did not want us to remove this from the data, the argument being it would make the resultant engine more resilient to this kind of thing.

keep speaker information (some sort of hash), across different recordings

We do indeed have this information, but have opted not to release it just yet as we are still trying to understand privacy implications. Will these speaker id's be useful for speech-to-text? We realize they are useful for things like speaker identification and/or speech synthesis, but that is not the focus for Common Voice at this time.

less focus on reading, more on spontaneous speech...

Good suggestion on taking inspiration from LDC. Indeed, we want to create something like the Fisher Corpus using our website, but that requires a rethink of our current (admittedly simplistic) interaction model. Big thanks to @entn-at https://github.com/entn-at for the thoughtful comments on how we could make this work. I completely agree that level of effort is something we need to pay close attention to. And if we can make this somehow fun, or useful in a way besides providing data (like talking with friend), then we are on the right track.

To this end, we are currently in the design process for "Collecting Organic Speech." We started with many big and sometimes crazy ideas (accent trainers, karaoke dating apps, a necklace with a button that can submit last 15 seconds of audio), and narrowed in on a few ideas we want to explore. Our current thinking is that we want to connect individuals who use the site and have them speak to each other somehow. We also want this to be fun, so we will have prompts and perhaps games (e.g. "Draw Something https://www.zynga.com/games/draw-something," but with audio).

That said, the time horizon would be late 2018 at the earliest. Our current engineering focus is on making Common Voice multi-language, and also increasing engagement on the site.

additional metadata that could be useful: tag/labels for extraneous speech/noise, etc; device info; user demographics; headset/bluetooth

Good idea! We have a bug for this: mozilla/voice-web#814 https://github.com/mozilla/voice-web/issues/814

capture data in different acoustic environments (and when possible, capture metadata about the environment as well)

Right now we know browser, mobile vs. desktop, and sometimes OS. Is there any other metadata you'd like to see?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <https://github.com/kaldi-asr/kaldi/issues/2141#issuecomment-361314446 , or mute the thread https://github.com/notifications/unsubscribe- auth/AKisX-f9d1qTe_7O-gGaB_xwW_GwlT6Gks5tPfq2gaJpZM4RZojk .

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/kaldi-asr/kaldi/issues/2141#issuecomment-361807683, or mute the thread https://github.com/notifications/unsubscribe-auth/AEi_UIpCbDKa178g2BZ_owTUar56NjIyks5tP9X1gaJpZM4RZojk .

-- Daniel Galvez http://danielgalvez.me https://github.com/galv

johnjosephmorgan commented 6 years ago

In order to use the stages in the run.sh file the utils/parse_options.sh file needs to be sourced.

On 1/28/18, Ewald Enzinger notifications@github.com wrote:

The prompted nature of the collected speech was my main concern as well. Perhaps you can collect transcribed conversational speech via crowd sourcing as follows (I'm posting some ideas here and explicitly want to invite constructive criticism/comments by everybody; I know this has nothing to do with Kaldi per se, but people here have a lot of experience with data collected for ASR and can give valuable advice to @mikehenrty):

To collect conversational speech, let users call other users via WebRTC. Conversation will be easier if users already know each other. There are ways of recording a WebRTC call, for example using RTCMultiConnection and RecordRTC. AFAIK, WebRTC gateways like the open-source Janus project have plugins for recording calls. Each caller needs to be informed that the call is being recorded (and for what purpose), of course. There are obvious privacy issues for releasing the data and crowd sourcing the transcription effort, and reminders not to disclose private information only go so far.

For word-level transcription, use existing LVCSR systems (e.g. based on Kaldi, Mozilla DeepSpeech, or cloud speech APIs) to segment calls (i.e., create relatively short chunks) and transcribe them. Human listeners verifying these automatically transcribed segments could listen to segments and provide feedback in multiple ways:

Transcription correct: yes/no. This is the least-effort feedback option.

Add buttons (or click-/tap-able areas) for each word and between words. Users can click/tap on words to indicate insertion/substitution or between words to indicate deletions. More effort required by the listener.

In addition to (ii.), let users correct incorrectly transcribed segments/enter fully manual transcription. Higher effort required.

Keeping the required effort low is important for crowd sourcing, but as has been pointed out, badly or inconsistently transcribed speech isn't that useful. There is no question that a professionally designed and transcribed speech corpus is preferable in every way, but I'd be interested in feedback/suggestions (also and especially in the form of "This won't work because...").

Disclaimer: I'm not involved in this data collection effort, I'm just trying to be helpful.

-- You are receiving this because you are subscribed to this thread. Reply to this email directly or view it on GitHub: https://github.com/kaldi-asr/kaldi/issues/2141#issuecomment-361038976