mozilla / DeepSpeech

DeepSpeech is an open source embedded (offline, on-device) speech-to-text engine which can run in real time on devices ranging from a Raspberry Pi 4 to high power GPU servers.
Mozilla Public License 2.0
25.17k stars 3.95k forks source link

Inferencing in real-time #847

Closed abuvaneswari closed 6 years ago

abuvaneswari commented 7 years ago

Is there a streaming server & client code that does the following?

(a) on the client side, continuously generates PCM data samples from the mic connected to PC, sends the samples every, say 100ms to the server and prints out the transcripts from the server as they arrive (I am thinking of a client similar to the Google Speech API - Streaming Python client example : https://github.com/GoogleCloudPlatform/python-docs-samples/blob/master/speech/cloud-client/transcribe_streaming_mic.py )

(b) on the server side, responds to the client's samples by waiting for enough samples to build up, invokes deepspeech, sends the transcript back to the client and does this continuously as well

thanks, Buvana

lissyx commented 7 years ago

Not yet, but this is something I have in mind for demo purposes.

elpimous commented 7 years ago

Perhaps could you have a look at Respeaker on github, for recording part They work on specific mic array, and on different recording types

LearnedVector commented 7 years ago

@lissyx would there have to be any neural network architectural changes to support real-time inferencing, or would it just be an client/server engineering challenge

lissyx commented 7 years ago

It all depends on what you want to precisely achieve. "Real" streaming would require changes to the network for sure ; the bidirectional recurrent layers makes us forced to send "complete" data.

LearnedVector commented 7 years ago

@lissyx so with the current network, does sending chunks of audio data at a time make sense? From intuition i can see that it might inflate the WER rate by a little due to the language model not having the complete sequences of terms.

reuben commented 7 years ago

The WER increase will not just be from the language model, but also the network itself, since it depends on having the entire utterance available. The performance will depend on your training data. If you only train with long utterances (several seconds) and then try to do inference with chunks of one second each then it'll probably perform very poorly.

LearnedVector commented 7 years ago

@reuben that makes sense. Thanks for the clarification.

abuvaneswari commented 7 years ago

Does DeepSpeech (and its a feature of CTC, I suppose) require that the incoming features be fed at the word boundary? What if I construct an online moving window MFCC calculator and feed in the features without regard to the word boundary? Let us say that my window length is long enough to accomodate 5-6 grams; the first and last gram may be partial because the segmentation is done without regard to word boundary. Can such a design still infer words?

kdavis-mozilla commented 7 years ago

We assume incoming features are fed at word boundaries. Performance is further improved if they are at sentence boundaries, due to the language model being trained on sentences.

alanbekker commented 6 years ago

So what would be the recommended steps( in terms of training data, network topology) in order to build a speech recognition streaming sercice?

alanbekker commented 6 years ago

So what would be the recommended steps( in terms of training data, network topology) in order to build a speech recognition streaming sercice?

MainRo commented 6 years ago

I started to write a server to test inference on a generated model. It is available here: https://github.com/MainRo/deepspeech-server

This is a very first implementation that listens to http post requests. I plan to add support of websockets, provide a sample web-app to try it from a browser, and package it in a docker container.

@lissyx : as already discussed, tell me if you are interested in such a project.

MainRo commented 6 years ago

Update: I just published a Dockerfile to easily start the server. I tested it with deepspeech-0.1.0 and the pre-trained model published. See here : https://github.com/MainRo/docker-deepspeech-server

ashwan1 commented 6 years ago

I wrote django based web app that can record sound from browser and return its transcription. Its at its very first stage. I am planning it to make websocket based and google speech API based. So that, I don't have to change much in my other projects, apart from changing socket url. I'll try to take it to real time transcription as soon as possible.

alanbekker commented 6 years ago

Can you please elaborate a bit more how we can s2t in real time using a bi-directional RNN? As far as I see this we need to wait to the end of the speech in order to begin the decoding...maybe I misunderstand something will be happy to be corrected

On Sat, Jan 6, 2018 at 6:58 PM, Ashwani Pandey notifications@github.com wrote:

I wrote django based web app https://github.com/sci472bmt/django-deepspeech-server that can record sound from browser and return its transcription. Its at its very first stage. I am planning it to make websocket based and google speech API based. So that, I don't have to change much in my other projects, apart from changing socket url. I'll try to take it to real time transcription as soon as possible.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/mozilla/DeepSpeech/issues/847#issuecomment-355759977, or mute the thread https://github.com/notifications/unsubscribe-auth/AE_DSfnH40RToxOl1hu-v4VllVBA0jfmks5tH6Y8gaJpZM4PhQ1U .

ashwan1 commented 6 years ago

@alanbekker I have 2 plans. I am currently doing research as to what will work best:

  1. This is more like pseudo real time. I'll train my model with small utterence, stream wav files to server. Each wav file will contain voice from starting time t1. As speaker will continue speaking, wav file size streamed to server will keep on increasing. And I think well trained deepspeech server can return approximately correct transcriptions. At last server will receive full sentence as audio, which it can transcribe. This should work atleast for small audio(say, 5 sec). Moreover we can couple this with silence detection and marking non changing transcriptions as final to optimize performance.
  2. Another idea is to use one pass decoding with RNNLM as mention in this paper.
alanbekker commented 6 years ago

In respect to (1) you assume you can train your model on small utterances..but in order to so you will need to align the small utterances with the corresponding transcription (more labeled data is needed) , am I wrong?

could you please explain how is (2) a another alternative for (1)?

Thanks!

On Sun, Jan 7, 2018 at 7:29 PM, Ashwani Pandey notifications@github.com wrote:

@alanbekker https://github.com/alanbekker I have 2 plans. I am currently doing research as to what will work best:

  1. This is more like pseudo real time. I'll train my model with small utterence, stream wav files to server. Each wav file will contain voice from starting time t1. As speaker will continue speaking, wav file size streamed to server will keep on increasing. And I think well trained deepspeech server can return approximately correct transcriptions. At last server will receive full sentence as audio, which it can transcribe. This should work atleast for small audio(say, 5 sec). Moreover we can couple this with silence detection and marking non changing transcriptions as final to optimize performance.
  2. Another idea is to use one pass decoding with RNNLM as mention in this paper https://pdfs.semanticscholar.org/8ad4/4f5161ad04c71fe052582168bd7a45217d36.pdf .

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/mozilla/DeepSpeech/issues/847#issuecomment-355838250, or mute the thread https://github.com/notifications/unsubscribe-auth/AE_DSbu73xvAiCf5bDNHFXFdNnCXXhWxks5tIP7ngaJpZM4PhQ1U .

ashwan1 commented 6 years ago

You are correct that more labeled data is needed for plan 1. Thus I am looking into possibility of subtitles. But I would also like to see distributed model's performance in such scenario. 2 is not exactly alternative for 1. It will require changing network structure. So that's the just another thing I will be trying. But research for better alternatives continues... Ultimately, we need to make this real time(more or less like google speech API). Any thing that works efficiently will do.

AMairesse commented 6 years ago

@ashwan1 for 2 you could also try https://github.com/inikdom/rnn-speech Performance is not quite good because it still lacks a language model but inference is mono-directional so you could easily build a real-time transcription layer with state in order to “see” the transcription evolve while receiving the audio.

ashwan1 commented 6 years ago

Thanks a lot for suggestion :) I will definitely try that.

AMairesse commented 6 years ago

@ashwan1 you may also check https://github.com/SeanNaren/deepspeech.pytorch Performance is way better, even without a language model. Pre-trained networks are bidirectional but there's support for unidirectional mode like in the DeepSpeech2 paper.

jenniferzhu commented 6 years ago

@lissyx A follow-up question: how can we use this model to transcribe long video. 5 sec clip is too short...

lissyx commented 6 years ago

@jenniferzhu How much long ? Best way until #1275 is done would be to cut audio on silences.

kausthub commented 6 years ago

@lissyx is it possible for me to test #1275 on my local ?? I really want to build an app for which does s2t "real" time. Can you please also suggest the steps which i need to do to setup this branch to run ?

Note: I have already setup deepspeech and understand the main components involved in training,testing and running it. Interested in making this "real-time".

Thanks in advance

lissyx commented 6 years ago

@kausthub Just checkout the streaming-inference branch and build with it.

kausthub commented 6 years ago

I couldn't find the steps to build. Sorry if this is a beginner level question.

Thanks in advance

On Fri, 27 Apr 2018, 15:42 lissyx, notifications@github.com wrote:

@kausthub https://github.com/kausthub Just checkout the streaming-inference branch and build with it.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/mozilla/DeepSpeech/issues/847#issuecomment-384928280, or mute the thread https://github.com/notifications/unsubscribe-auth/AJO3_fuMOmWfEvxiFKnTC9ibJ1iwnQRAks5tsu8qgaJpZM4PhQ1U .

lissyx commented 6 years ago

@kausthub It's in native_client/README.md: https://github.com/mozilla/DeepSpeech/blob/streaming-inference/native_client/README.md

kausthub commented 6 years ago

Thanks will check that out.

On Fri, 27 Apr 2018, 15:51 lissyx, notifications@github.com wrote:

@kausthub https://github.com/kausthub It's in native_client/README.md: https://github.com/mozilla/DeepSpeech/blob/streaming-inference/native_client/README.md

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/mozilla/DeepSpeech/issues/847#issuecomment-384930112, or mute the thread https://github.com/notifications/unsubscribe-auth/AJO3_blSMRILzU8trMgmEHcasXIwS88Nks5tsvEWgaJpZM4PhQ1U .

lissyx commented 6 years ago

Nothing anymore to do here.

lock[bot] commented 5 years ago

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.