mozilla / DeepSpeech

DeepSpeech is an open source embedded (offline, on-device) speech-to-text engine which can run in real time on devices ranging from a Raspberry Pi 4 to high power GPU servers.
Mozilla Public License 2.0
24.81k stars 3.93k forks source link

Feature Request: Free captcha service to improve verification of machine learning content #2089

Open sandreas opened 5 years ago

sandreas commented 5 years ago

There is already an excellent web application to get new language data. Would it be possible, to improve the quality of verification tasks by implementing a free captcha service like reCAPTCHA? In my opinion google heavily uses the "user feedback" of reCAPTCHA to improve their machine learning tasks.

I would prefer using an open source product instead of reCAPTCHA on my web pages. The whole app should be localized, so that every language could be improved.

I thought about the following possibilities:

Problems:

What do you think? Is it worth the effort?

lissyx commented 5 years ago

Well, this is an idea a few of us already had years ago now, but, personnally, I'm asking myself:

dabinat commented 5 years ago

If the audio hasn’t yet been validated, how would it know the transcription the end-user entered was correct? It has the original sentence but it would have to assume the speaker uttered that sentence perfectly.

sandreas commented 5 years ago

@lissyx

I'm a bit confused about the pictures one, I don't get the idea

Well, this would be a kind of dictionary verification / enlargement. Since there are lots of words in a dictionary and many of them are language specific, it could be useful to build dictionaries with words, render them as pictures and let the user select the words, that are common in his language to verify dictionaries.

@dabinat

If the audio hasn’t yet been validated, how would it know the transcription the end-user entered was correct?

This is a good question... it is not possible, but since reCAPTCHA also uses a "NEXT" and a "VERIFY" Button, it would be possible to make the user qualify 1 or 2 samples, that are not validated yet and then end up with a validated one from the database.

Well, the idea was new to me... but if it has already been discussed, it might be not worth the effort. Thank you for your feedback!

lissyx commented 5 years ago

Well, the idea was new to me... but if it has already been discussed, it might be not worth the effort. Thank you for your feedback!

Well, I remember throwing out the idea, but it was not "discussed" in the sense that we took a decision. And you're not the first one to suggest it, so with more polish it's likely not a bad idea!

sandreas commented 5 years ago

Since there is no further feedback and no final solution, would you like me to close this ticket? Or should it stay open for further investigation?

nukeador commented 5 years ago

We can continue the discussion on Common Voice discourse, since we are talking about data collection:

https://discourse.mozilla.org/t/making-an-open-source-captcha-from-common-voice/42437

lissyx commented 5 years ago

We can continue the discussion on Common Voice discourse, since we are talking about data collection:

https://discourse.mozilla.org/t/making-an-open-source-captcha-from-common-voice/42437

That's right, let's keep the discussion on Discourse and keep the bug open, since there's indeed several people interested.

ssokolow commented 4 years ago

From what I remember, the original reCAPTCHA would:

  1. Present you with a known and an unknown word to determine whether you got it right
  2. Farm the same word out to a bunch of people to up the confidence that the result was correct

The current reCAPTCHA would do something similar to get sufficient confidence in the correctness of the input.