mozilla / DeepSpeech

DeepSpeech is an open source embedded (offline, on-device) speech-to-text engine which can run in real time on devices ranging from a Raspberry Pi 4 to high power GPU servers.
Mozilla Public License 2.0
25.03k stars 3.94k forks source link

STT not working correctly for the word at the end of the sentence #1645

Closed rajpuneet closed 5 years ago

rajpuneet commented 5 years ago

I am using version 0.2.0 of deepspeech. I have tried two ways of testing deepspeech:

  1. A person with american accent spoke into the microphone
  2. Fed a .wav file as an input (mono channel 16kHz) What I have observed is that in both the cases, the last word of the audio input is interpreted incorrectly by deepspeech. Some examples are: jungle-juml candy-candi chocolate-chocolatt guy-gody specialty-specialti apple-aple news-new dinner-vinner near me-nearme name-nme and many more

Next I gave a .wav file as input that contained one of these words in the middle of the sentence and repeated it for each word in a separate file. The word was recognized properly but again the word at the end of the sentence this time was not.

Does anybody have an idea what the problem could be and how can it be fixed?

lissyx commented 5 years ago

@rajpuneet We had a bug on 0.2.0, can you retry with 0.3.0-alpha.1 ? Please make sure to use data/lm/trie file and not the one from 0.2.0 release, they are not compatible.

kdavis-mozilla commented 5 years ago

@rajpuneet Any results on this? If not, I have 0.3.0-alpha.1 setup and could test for you.

rajpuneet commented 5 years ago

@kdavis-mozilla, I haven’t tried it yet. But, yes if you have it setup and could test it that would be great. Let me know how it goes. Thanks.

kdavis-mozilla commented 5 years ago

@rajpuneet Could you send a link to the wav files you tested with? Then we can have a fair comparison.

rajpuneet commented 5 years ago

@kdavis-mozilla here are the files: Audio test files.zip

kdavis-mozilla commented 5 years ago

@rajpuneet How were the audio files generated?

rajpuneet commented 5 years ago

DeepSpeech 0.3.0 alpha.1 results.docx

These are my results with 0.3.0-alpha.1. The same problem is there. I have highlighted the errors in the file and the correct word is written in parenthesis next to the error.

kdavis-mozilla commented 5 years ago

@rajpuneet I can confirm I'm seeing the same problem too.

However, could you tell us how the audio was generated?

rajpuneet commented 5 years ago

Most of the files were generated using this open source application by eSpeak called TTSApp. But, 2-3 files were something found online. This last word issue was not there in the release 0.1.1 when we used the same files. But, it's there in the newer releases.

rajpuneet commented 5 years ago

Other than this, I have even tested this with an actual person with an American accent speaking into the microphone and the same problem was there

kdavis-mozilla commented 5 years ago

@rajpuneet Thanks! We're looking in to the problem.

lissyx commented 5 years ago

@rajpuneet @kdavis-mozilla we should bisect from alpha builds that we have :-)

madhavajay commented 5 years ago

Hrm, I think i see this issue in v0.2.0 as well. Is it fixed in the v0.2.1 alphas? I was wondering why the quality of output was lower than 0.1.1 on obvious things like "yes" with the language model included. In v0.1.1 I get "yes" and with v0.2.0 i get stuff like "yesy".

kdavis-mozilla commented 5 years ago

@lissyx It wouldn't have been fixed in any v0.2.1 alphas.

@madhavajay 0.2.0 and 0.1.X are different enough that a direct comparison is a bit "apples and oranges".

Reuben and myself talked a bit about this yesterday and we think we have some leads on where the 0.2.X and 0.3.X-alphaY bug may lie.

lissyx commented 5 years ago

@lissyx It wouldn't have been fixed in any v0.2.1 alphas.

No, but if we know it was good on 0.1.1, then we still can bisect from there, although it's likely coming from the model itself and not the inference code ?

kdavis-mozilla commented 5 years ago

@lissyx We think it's from the BRNN to RNN switch along with the stride. Reuben started runs last night to test this hypothesis.

saikishor commented 5 years ago

These are my results with 0.3.0-alpha1 deepspeech_0.3.0_alpha_1_test_results.odt with the audio_files

The audio files have already been tested on the DeepSpeech 0.1.1 and they all worked..

kdavis-mozilla commented 5 years ago

@saikishor From your tests I don't get impression that the failures are correlated with the last word. Quite the contrary, the failures seem random.

saikishor commented 5 years ago

@kdavis-mozilla I agree with you, the problem of the last word in 0.2.0 is not seen here.

rajpuneet commented 5 years ago

I tested the latest release i.e. 0.3.0 with one of the audio files that were shared earlier and I have attached it here again. If the lm and trie are not used, then the last word is interpreted correctly as shown below (please ignore the options that I have set for lm and trie as were incorrect):

(venv) sranjeet@sranjeet-Precision-M4800:~/GM/DeepSpeech_0.3$ deepspeech --model models/output_graph.pbmm --alphabet models/alphabet.txt --lm models/trie --audio ~/GM/deepspeech/it_was_his_heart_that_would_tell_him_where_his_treasure_was_hidden.wav Loading model from file models/output_graph.pbmm TensorFlow: v1.11.0-9-g97d851f DeepSpeech: v0.3.0-0-gef6b5bd 2018-10-23 14:22:39.312461: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA Loaded model in 0.0116s. Running inference. it was his heart that would tell him where his treasure was hidden Inference took 2.578s for 3.889s audio file.

On using the lm and trie, the same issue was seen as shown below:

(venv) sranjeet@sranjeet-Precision-M4800:~/GM/DeepSpeech_0.3$ deepspeech --model models/output_graph.pbmm --alphabet models/alphabet.txt --lm models/lm.binary --trie models/trie --audio ~/GM/deepspeech/it_was_his_heart_that_would_tell_him_where_his_treasure_was_hidden.wav Loading model from file models/output_graph.pbmm TensorFlow: v1.11.0-9-g97d851f DeepSpeech: v0.3.0-0-gef6b5bd 2018-10-23 14:24:11.550713: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA Loaded model in 0.00925s. Loading language model from files models/lm.binary models/trie Loaded language model in 17.9s. Running inference. it was his heart that would tell him where his treasure was hidded Inference took 2.673s for 3.889s audio file.

it_was_his_heart_that_would_tell_him_where_his_treasure_was_hidden.wav.zip

reuben commented 5 years ago

@rajpuneet we're currently in the process of integrating a different CTC decoder implementation into our native clients since it offers a bunch of features we're interested in (confidence scores, character-based language models) and also doesn't exhibit these weird behaviors we've been seeing (words glued together, errors at end of sentence). The work is somewhat organized here: https://github.com/mozilla/DeepSpeech/projects/7

Target is late November/early December.

rajpuneet commented 5 years ago

@lissyx @reuben @kdavis-mozilla We are working on a project that needs this ctc decoder issue fixed earlier than that, so we were thinking of fixing this issue ourselves. But, this raised another concern for us. I tried to fine tune the 0.3.0 release model and one of the parameters that had to be specified was '--decoder_library_path' where we specify the path for 'libctc_decoder_with_kenlm.so'. So, does that mean if we come up with a different ctc decoder implementation ourselves then we will have to train the accoustic model from scratch?

reuben commented 5 years ago

@rajpuneet no, the decoder is only used for test epochs where we generate WER reports.

rajpuneet commented 5 years ago

Ok, thanks @reuben!

reuben commented 5 years ago

@rajpuneet and FWIW, I was a bit too cautious with my initial estimate. You can already try out the new decoder by using the code in https://github.com/mozilla/DeepSpeech/pull/1679

rajpuneet commented 5 years ago

@reuben, can we just switch to the branch 'ctcdecode' and get the latest code on top of release 0.3.0 to get all the commits?

reuben commented 5 years ago

@rajpuneet yes, just using that branch should be enough, it's based on a recent master.

rajpuneet commented 5 years ago

Thanks, @reuben.

sranjeet81 commented 5 years ago

@reuben I tried building the master branch today which includes the ctc decoder change, but could find the results out of the inference to be completely messed up. Below are couple of examples inference runs for the utterances that is the name of the wav file and the results are off. Am I missing something as part of building and integration of the new deepspeech with the updated ctcdecode?

nvidia@tegra-ubuntu:~/deepspeech$ deepspeech --model models/output_graph.pbmm --alphabet models/alphabet.txt --lm models/lm.binary --trie models/trie.ctcdecode --audio wav/**it_was_his_heart_that_would_tell_him_where_his_treasure_was_hidden.wav Loading model from file models/output_graph.pbmm TensorFlow: v1.6.0-rc1-1453-g8f1e480 DeepSpeech: v0.3.0-0-gef6b5bd 2018-10-30 22:04:56.303180: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:865] ARM64 does not support NUMA - returning NUMA node zero 2018-10-30 22:04:56.303304: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1344] Found device 0 with properties: name: NVIDIA Tegra X2 major: 6 minor: 2 memoryClockRate(GHz): 1.3005 pciBusID: 0000:00:00.0 totalMemory: 7.67GiB freeMemory: 3.18GiB 2018-10-30 22:04:56.303368: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1423] Adding visible gpu devices: 0 2018-10-30 22:04:57.655171: I tensorflow/core/common_runtime/gpu/gpu_device.cc:911] Device interconnect StreamExecutor with strength 1 edge matrix: 2018-10-30 22:04:57.655245: I tensorflow/core/common_runtime/gpu/gpu_device.cc:917] 0 2018-10-30 22:04:57.655272: I tensorflow/core/common_runtime/gpu/gpu_device.cc:930] 0: N 2018-10-30 22:04:57.655417: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1041] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 2687 MB memory) -> physical GPU (device: 0, name: NVIDIA Tegra X2, pci bus id: 0000:00:00.0, compute capability: 6.2) Loaded model in 1.92s. Loading language model from files models/lm.binary models/trie.ctcdecode Loaded language model in 1.18s. Running inference. te avy tli sil h la niss pu x zeg g hrar cior zmai azan gaym tc scs ij mel hajun yl gr xau hao tylo pr lae a giv ums mq lah knt p sob g ijs ba keg n dufy yay vj lik h r jiz f my we eym ogu mvs** Inference took 4.900s for 3.889s audio file.

nvidia@tegra-ubuntu:~/deepspeech$ deepspeech --model models/output_graph.pbmm --alphabet models/alphabet.txt --lm models/lm.binary --trie models/trie.ctcdecode --audio wav/can_you_tell_me_what_time_is_it.wav Loading model from file models/output_graph.pbmm TensorFlow: v1.6.0-rc1-1453-g8f1e480 DeepSpeech: v0.3.0-0-gef6b5bd 2018-10-30 22:05:28.978054: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:865] ARM64 does not support NUMA - returning NUMA node zero 2018-10-30 22:05:28.978199: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1344] Found device 0 with properties: name: NVIDIA Tegra X2 major: 6 minor: 2 memoryClockRate(GHz): 1.3005 pciBusID: 0000:00:00.0 totalMemory: 7.67GiB freeMemory: 2.90GiB 2018-10-30 22:05:28.978248: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1423] Adding visible gpu devices: 0 2018-10-30 22:05:29.622354: I tensorflow/core/common_runtime/gpu/gpu_device.cc:911] Device interconnect StreamExecutor with strength 1 edge matrix: 2018-10-30 22:05:29.622448: I tensorflow/core/common_runtime/gpu/gpu_device.cc:917] 0 2018-10-30 22:05:29.622475: I tensorflow/core/common_runtime/gpu/gpu_device.cc:930] 0: N 2018-10-30 22:05:29.622670: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1041] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 2532 MB memory) -> physical GPU (device: 0, name: NVIDIA Tegra X2, pci bus id: 0000:00:00.0, compute capability: 6.2) Loaded model in 1.14s. Loading language model from files models/lm.binary models/trie.ctcdecode Loaded language model in 0.455s. Running inference. te avy tli sil h la nissi kile orh vez oz ono eh une tls rs oa nik ven k pq o' c k dm d tl x nef bi q lim bap p af ehr nuw t ic kae g gi ah sb ux hh s vs j soy e lk lo by r gn lyoh soffi cake cafy gn iz

reuben commented 5 years ago

@sranjeet81 you need to re-export the model with the code on master. Or you can get the re-exported model from here: https://github.com/reuben/DeepSpeech/releases/tag/v0.2.0-prod-ctcdecode

sranjeet81 commented 5 years ago

@reuben Thanks. Now I had started getting errors below on running the inference. Could it be because of the tensorflow version used TensorFlow: v1.6.0-rc1-1453-g8f1e480?

2018-10-30 22:15:46.621858: E tensorflow/core/common_runtime/executor.cc:644] Executor failed to create kernel. Invalid argument: NodeDef mentions attr 'Truncate' not in Op<name=Cast; signature=x:SrcT -> y:DstT; attr=SrcT:type; attr=DstT:type>; NodeDef: lstm_fused_cell/ToInt64 = CastDstT=DT_INT64, SrcT=DT_INT32, Truncate=false, _device="/job:localhost/replica:0/task:0/device:GPU:0". (Check whether your GraphDef-interpreting binary is up to date with your GraphDef-generating binary.). [[Node: lstm_fused_cell/ToInt64 = CastDstT=DT_INT64, SrcT=DT_INT32, Truncate=false, _device="/job:localhost/replica:0/task:0/device:GPU:0"]] Error running session: Invalid argument: NodeDef mentions attr 'Truncate' not in Op<name=Cast; signature=x:SrcT -> y:DstT; attr=SrcT:type; attr=DstT:type>; NodeDef: lstm_fused_cell/ToInt64 = CastDstT=DT_INT64, SrcT=DT_INT32, Truncate=false, _device="/job:localhost/replica:0/task:0/device:GPU:0". (Check whether your GraphDef-interpreting binary is up to date with your GraphDef-generating binary.).

reuben commented 5 years ago

@sranjeet81 yes, I think so. master is currently built against TensorFlow r1.11.

reuben commented 5 years ago

@sranjeet81 FWIW you can grab binaries from here: https://tools.taskcluster.net/task-group-inspector/#/FyclewklSUqN6FXHavrhKQ

Click the job for your architecture then click the Run Artifacts and grab native_client.tar.xz.

sranjeet81 commented 5 years ago

@reuben Your binaries for ARM64 worked for me on JetSon TX2. I will try to build tensorflow v1.11 and give it a try for the GPU. Results look positive from this issue. I will keep you posted on more inference results tomorrow. Thanks.

reuben commented 5 years ago

@sranjeet81 great! Thanks for testing.

kdavis-mozilla commented 5 years ago

@sranjeet81 Just out of curiosity what use case do you have in mind when targeting the Jetson TX2?

saikishor commented 5 years ago

@kdavis-mozilla people are much interested in TX2 as it has much more computational power to handle Deepspeech. With other platforms, you will have the inference done, but it takes a lot of time for the inference, which is not a use case if they would like to use them in their projects. So, most of us want to have the deepspeech running with less inference time, atleast for me this is the reason. however, we obtain 4 sec of inference time for an audio of 4 seconds, which is ok, but something is better than worse.

kdavis-mozilla commented 5 years ago

@saikishor @sranjeet81 If processing power is the only selection criteria, then why not a laptop, desktop, or server?

I'd assume there are other factors at play. For example, the TX2 is the on-board computer for a robot or the TX2 is the on-board computer for a car or any number of other selection criteria.

saikishor commented 5 years ago

@kdavis-mozilla I agree with you!! If you clearly observe the computational load on the cpu using deepspeech, it is consuming way more resources so that other programs may not have enough resources to run on the robot or any application. Again, adding a new computer on-board only for the speech recognition or other deep learning stuff consumes lot of space and its unnecessary. Instead, If we use TX2 it is both compact and powerful for these applications. This is my clear opinion.

daanzu commented 5 years ago

In brief testing, this issue appears to be fixed, although I usually test through Python bindings rather than native_client executable. The new CTC decoder appears much faster too.

lissyx commented 5 years ago

In brief testing, this issue appears to be fixed, although I usually test through Python bindings rather than native_client executable. The new CTC decoder appears much faster too.

There are also Python bindings, so you can check with them.

reuben commented 5 years ago

Thanks for checking! I'm gonna close this issue but feel free to reopen if you think this is still a problem.

rajpuneet commented 5 years ago

@sranjeet81 you need to re-export the model with the code on master.

@reuben, how do we re-export the model? because for 0.4.0 pre-release you haven't provided any models

reuben commented 5 years ago

python DeepSpeech.py --notrain --notest --checkpoint_dir /path/to/checkpoint --export_dir /path/to/export

Or if you just want our v0.3 model re-exported to work with the new decoder, grab it from here: https://github.com/reuben/DeepSpeech/releases/tag/v0.2.0-prod-ctcdecode)

rajpuneet commented 5 years ago

thanks @reuben, that worked

lock[bot] commented 5 years ago

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.