Closed BradNeuberg closed 5 years ago
@kdavis-mozilla @lissyx
Thank you for filling out the issue template :)
This misbehavior seems to happen only when the acoustic model is not doing a very good job. I agree the decoder should not degrade to that level. I haven't had the chance to debug this issue other than tweaking the decoder hyperparameters to try to alleviate the problem. I'll take a closer look.
Yes. In some weird cases when the acoustic model is not performing well the decoder falls into this weird state of gluing together words. I'm hoping it can be fixed by tweaking the beam search implementation.
What assumptions does the acoustic model make (i.e. whats the distribution and characteristics of the audio training data?) The audio I provided sounds pretty clear IMHO, but perhaps the audio training data doesn't have enough diversity to help the deep net generalize (i.e. the deep net is essentially overfitting to the training data and isn't generalizing well).
@reuben Any further progress on this by any chance?
Facing the same issue. Any progress or way out to improve the performance?
How can I train my model without the language model?
@nyro22 Training never involves the language model. Computing WER's, however does.
I am facing the same issue on a rather similar configuration to the one described above. Was there any progress on this? Thanks!
facing the same problem..... /data/home/DeepSpeech# /data/home/DeepSpeech/deepspeech phoneme_output_graph.pb phoneme.txt A2_1.wav TensorFlow: v1.6.0-11-g7554dd8 DeepSpeech: v0.1.1-48-g31c01db Warning: reading entire model file into memory. Transform model file into an mmapped graph to reduce heap usage. 2018-05-03 10:27:27.750965: I tensorflow/core/platform/cpu_feature_guard.cc:140] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA 2018-05-03 10:27:28.111299: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1212] Found device 0 with properties: name: Tesla M40 24GB major: 5 minor: 2 memoryClockRate(GHz): 1.112 pciBusID: 0000:02:00.0 totalMemory: 23.90GiB freeMemory: 22.71GiB 2018-05-03 10:27:28.111338: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1312] Adding visible gpu devices: 0 2018-05-03 10:27:28.318726: I tensorflow/core/common_runtime/gpu/gpu_device.cc:993] Creating TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 22058 MB memory) -> physical GPU (device: 0, name: Tesla M40 24GB, pci bus id: 0000:02:00.0, compute capability: 5.2)
Hi @bolt163, It seems that your problem is anywere else... Could you provide wanted transcript of this wav ?!
having the same issues. if I modify the beam search algorithm myself, what would be the steps to recompile using the updated beam search?
@reuben I am also facing the same issue ? Any suggestions ?
Same here. Also, seems to be happening when using out-of-vocabulary terms.
Probably the bug is somewhere in this function:
Seems that the problem is that sequences with out-of-vocabulary words receive more score without spaces than with spaces.
Does the beam search use Length Normalization? According to Andrew Ng, it improves the beam search by reducing the penalty for outputting sentences with higher number of words. Andrew talks about it in this video.
EDIT: I just realized that the 'word_countweight' is performing this
@GeorgeFedoseev I've been trying to debug this part of code you have pointed, and I noticed some weird behavior. Next, I've printed the score for, respectively: (1) a common word of my corpus; (2) a rare word of the corpus; (3) an invalid/out-of-vocabulary word; (4) the score of the variable 'oovscore'. I am printing the scores on some different states:
'não': -2.9632
'informação': -4.43594
'fdfdg': -5.15036
oov_score: -4.70759
------------------------------
'não': -2.97739
'informação': -4.45013
'fdfdg': -5.16455
oov_score: -4.70759
------------------------------
'não': -2.88466
'informação': -5.84782
'fdfdg': -6.56224
oov_score: -4.70759
Notice that the oov_score is not the same as the invalid word, and in some cases it is even higher than a valid word. I tried to add the following lines to the code:
Model::State out;
oov_score_ = model_.FullScore(from_state.model_state, model_.GetVocabulary().NotFound(), out).prob;
and now it appears that the score of the invalid word and the variable are similar. When testing on my examples, it is not enough to solve this problem, but it certainly reduced the 'gluing together words'.
PS: words are from my pt-br language corpus
@bernardohenz
I think you printed scores for (1), (2) and (3) that depend on state, but oov_score_
does not depend on state (in master), and you cannot compare them.
If you print (3) with model.NullContextState()
wouldn't it be the same as oov_score_
?
@GeorgeFedoseev yes, it is true. But why wouldn't oov_score
depend on the state? I think it makes sense to compute the oov_score
for each state. What do you think?
@bernardohenz as I understand code: when the construction of word is not finished yet (if (!alphabet_.IsSpace(to_label))
part), to tell beam search that its going in right direction, we are adding minimum unigram score of the word that this search can lead to. And this minimum unigram score is precomputed without state (with model.NullContextState()) and saved in trie file. To get this score dynamically depending on state you will need for each prefix find all possible words that it can lead to and select minimum score (which is probably very slow).
So oov_score_
doesnt depend on state cause we are comparing OOV braches of beam search with in-vocabulary branches, which are scored using scores from trie file (and that scores don't depend on state).
But the problem of assigning such minimum unigram score (or oov_score
) is that, during beam search, the algorithm is preferring to concatenate lots of characters, rather than choosing an space
and finish a low-probability word (such my (2) example).
One idea that occurred to me is to penalize longer words, so to avoid cases where the algorithm tries to concatenate more than 3 words together without a space.
Replacing
oov_score_ = model_.FullScore(model_.NullContextState(), model_.GetVocabulary().NotFound(), out).prob;
with
oov_score_ = -1000.00;
helped. Did I just raise another error?
In fact I created another variable (oov_score_2
) to compute this value (oov_score_
can't be modified inside the function).
And, I do not know if it is a good idea to set oov_score_ = -1000.00;
, since this is used when you are composing the word (char by char). The point of 'correcting' the oov_score_
is to avoid the alg to just decide to gluing all characters together (without a space char).
@bernardohenz I think that in that part (if (!alphabet_.IsSpace(to_label))
) it should be just that oov word gets score lower than any vocabulary word.
Try to increase word_count_weight_
from default 1 to something like 3.5. This resulted in less concatenation for me and decreased my WER by 3-4%.
I've implemented length normalization (word_count_weight was only a gross approximation) as well as switch to a fixed OOV score (which was in a TODO list for a long time) as part of the streaming changes, which will be merged soon for our next release. When we have binaries available for testing I'll comment here so anyone interested can test if it improves the decoder behavior on the cases described here. Thanks a lot for the investigation and suggestions, @bernardohenz, @GeorgeFedoseev and @titardrew!
@BradNeuberg @reuben is this issue closed? Am running the 0.2.0va7 (with ldc93s1 and a new wav file) version of Deepspeech, and the result (like hiieieddiitwenty) doesn't match with the language model. If there is any tweek to force respecting the model, am buying it even if it is time consuming.
Hi @reuben, I am also seeing this problem in the master branch. Could you provide a patch with your implementation to deal with it?
Hi @reuben when will you have the binaries available?
+1😉
They will be available with our next release, v0.2, when it is ready :)
Hi @reuben Any update on these binaries? I too would like to test their impact on decoder behavior.
We're currently training a model for the v0.2 release. Send me an email at {my github username} at mozilla.com and I'll give you access to a preliminary trained model so you can test the code changes.
If you have your own model and just want the binaries, they're available here: https://tools.taskcluster.net/groups/ClsXrFSbTJ6uUkEAPqFG8A
The Python and Node packages are also available, just specify version 0.2.0-alpha.9
@reuben Dropped a mail to you !!
I need the new decoder library so binary for Linux x64, how can I download it from the URL given by @reuben ? I am a bit lost on that webpage.
When I click on DeepSpeech Linux AMD64 CPU and then on artifacts and then download the public/native_client.tar.xz I dont see any changes in my decoded output when using this .so library compared to the current one, there is still only one or two words followed by a very very long one without spaces... despite ensuring that my model frequently outputs white spaces, and the beam and greedy decoding output looks fine
just tested 0.2.0 release (deepspeech and models), still get long words out of English vocabulary,
This example is a phone call recording (one channel out of two), TTS works well for the first sentence (a pre-recorded welcome message). Then it is a part of real conversation. TTS doesn't work properly.
The command and outputs are
(deepspeech-venv) jonathan@ubuntu:~$ deepspeech --model ~/deepspeech-0.2.0-models/models/output_graph.pb --audio ~/audio/C2AICXLGB3D2SMK4WPZF26KEZTRUA6OYR1.wav --alphabet ~/deepspeech-0.2.0-models/models/alphabet.txt --lm ~/deepspeech-0.2.0-models/models/lm.binary --trie ~/deepspeech-0.2.0-models/models/trie Loading model from file /home/jonathan/deepspeech-0.2.0-models/models/output_graph.pb TensorFlow: v1.6.0-18-g5021473 DeepSpeech: v0.2.0-0-g009f9b6 Warning: reading entire model file into memory. Transform model file into an mmapped graph to reduce heap usage. 2018-09-20 11:02:49.456955: I tensorflow/core/platform/cpu_feature_guard.cc:140] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA Loaded model in 0.134s. Loading language model from files /home/jonathan/deepspeech-0.2.0-models/models/lm.binary /home/jonathan/deepspeech-0.2.0-models/models/trie Loaded language model in 3.85s. Running inference. thank you for calling national storage your call may be recorded for coaching and quality the poses place let us not an if ye prefer we didn't record your colt to day in wall constrashionalshordistigwisjemaigay am so it just so he put in your code held everything in a disbosmygriparsesnwygorighticame so she's not like um that's all good if you won't care if i can just reserve something from my end over the foreign am i can reserve at the same on mine price you will looking out as well um which sent a and a unit is he looking out without which location for it an put it by sereerkapcoolofmijustrynorfrommians or a man we after the ground floor on the upper of a at Inference took 33.947s for 58.674s audio file.
The audio can be found from https://s3.us-east-2.amazonaws.com/fonedynamicsuseast2/C2AICXLGB3D2SMK4WPZF26KEZTRUA6OYR1.wav
@zhao-xin I'm facing the exact same problem. I am working with call recording. Were you able to fix this?
@zhao-xin I'm facing the exact same problem. I am working with call recording. Were you able to fix this?
@sunil3590 I feel this is not an engineering issue. The acoustic model is not trained with phone call conversations, the same as the language model, am I right?
We plan to collect our own data to tune deep speech models to make it can be used in the real world.
Are there any updates on this? I still have this issue and I am pretty sure it's not the models fault, since with normal decoding (greedy or beam search) I never get these very long words.
This is a big problem for me since those long words mess up the evaluation obviously, but a language would be necessary to get acceptable performance.
@reuben is currently working on moving to ctcdecode, which amongst others should fix this issue
Could anyone who's seeing this issue test the new decoder on master?
There's native client builds here: https://tools.taskcluster.net/groups/FyclewklSUqN6FXHavrhKQ
The acoustic model is the same as v0.2, and the trie is in data/lm/trie.ctcdecode
after you update to latest master. Testing with some problematic examples I had shows much better results, but the links in this thread are all broken so I couldn't test with your files.
Let me know how it goes.
Sorry, those instructions are incorrect. The acoustic model is the same as v0.2 but you need to re-export it with the master code. Alternatively you can grab it from here: https://github.com/reuben/DeepSpeech/releases/tag/v0.2.0-prod-ctcdecode
@reuben 's new work is working well for me on long, clean recordings.
I'm using:
npm link
ed locally).output_graph.pbmm
from Reuben's release (as linked above).The inference for a 45s podcast snippet seems pretty decent:
why early on in the night i mean i think there are a couple of states that are going to be really keep kentucky and virginia kentucky closes its poles a half in the eastern times on half in the central time on so that means that half of the states at six o'clock to visit seven o'clock and so have a lot of results and in watching one particular congressional district raciness district between antibarbarus i disengaged morabaraba a republican and this is a race that really should not be on the map this is a race that should be republican territory and if this race is a searching for much of the night in the democratizing well there that's a pretty good sign that the wave will be building
The inference for two recordings I made myself is almost totally wrong, but does not have incorrectly dropped spaces. I'm guessing the poor results are due to recording quality?
12s recording made with Bose QC35 II
he gravitationless theocratic circuitously manipulate intermediately creation of images and a frame buffer intended for alcohol
11s recording made with mid-'13 Macbook Air built-in mic
a gravitational latrocinia idly manipulate an alternator exploration of images in a frame of her intolerable
@spencer-brown On the recordings you made yourself, did you record directly to 16KHz, 16bit, mono audio? (The recordings sound like they were made at a lower Hz and/or bit depth.)
Also, I'd tend to agree that the drop in the recording quality is likely largely to blame for the poor results on the recordings you made yourself. We're currently training models that will be more robust to background noise.
Ah, no, I did not - thanks! In follow-up tests using those settings I'm seeing about 50% accuracy with the Bose headphones and nearly 0% with the Macbook Air mic. The recordings are still sound crackly relative to the training recordings.
Re: background-noise-robust models - exciting!
For anyone else still having trouble with this, i was able to make it work in the end by installing pytorch along with the ctcdecode library and then using that on top of my existing code, worked right of the gate with a KenLM language model!
@f90 you shouldn't need PyTorch (or the ctcdecode library) to use the new native client, the decoder is built-in.
I'm also experiencing the same issue, with words gluing together. I'm trying to run the new version as described by @spencer-brown above, but I'm experiencing some issues.
Deep Speech v0.3 is working on my system, but using the new version is throwing an error.
I'm using:
I downloaded the files, ran npm install
, and then ran the command:
node client.js --audio="./--audios_for_testing/90secondtest.wav" --model="./output_graph.pbmm" --trie="./trie.ctcdecode" --lm="./deepspeech_models/lm.binary" --alphabet="./deepspeech_models/alphabet.txt"
This is the output:
Loading model from file ./output_graph.pbmm
TensorFlow: v1.11.0-11-gbee825492f
DeepSpeech: unknown
2018-11-03 17:35:42.654139: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
dyld: lazy symbol binding failed: Symbol not found: __ZN2v87Isolate19CheckMemoryPressureEv
Referenced from: /Users/derekpankaew/Dropbox/Javascript Programming/speech_recognition/lib/binding/v1.0.0/darwin-x64/node-v57/deepspeech.node
Expected in: flat namespace
dyld: Symbol not found: __ZN2v87Isolate19CheckMemoryPressureEv
Referenced from: /Users/derekpankaew/Dropbox/Javascript Programming/speech_recognition/lib/binding/v1.0.0/darwin-x64/node-v57/deepspeech.node
Expected in: flat namespace
Abort trap: 6
Would love to get the new version to work - any thoughts?
Would love to get the new version to work - any thoughts?
Your output shows that it's not an official build. Please use official ones before reporting issues. and please give more context on your system.
The binary files and trie in https://github.com/mozilla/DeepSpeech/tree/master/data/lm alleviates this long-word problem. However my results are not as good as @spencer-brown for the same text.
I apply deepspeech with binary files and trie mentionned above (all the rest is just straight application of instructions in "Using the model" of https://github.com/mozilla/DeepSpeech)
Using ffmpeg to change the sampling rate to 16000
ffmpeg -i midterm-update-clipped.wav -acodec pcm_s16le -ac 1 -ar 16000 midterm-update-clipped2.wav
I get the following transcription for the 45 seconds podcast mentionned above (https://drive.google.com/file/d/1rmje0llC-PXJgTiAiuQcsPRSjaaWfsv_/view?usp=sharing):
Loading model from file models/output_graph.pbmm TensorFlow: v1.11.0-9-g97d851f DeepSpeech: v0.3.0-0-gef6b5bd Loaded model in 0.013s. Loading language model from files models/lm2.binary models/trie2 Loaded language model in 0.000145s. Running inference. why early on in the night i mean i think there are a couple states that are going to be really keep can tucky and virginia contucky closes its poles a half in te the eastern times own half an the sentral times on so that means that half of the states at six o'clock afh o vis had seven o'clock ah and joll have a lot of results and an waschings one particular congressional district race o six congressinal district between a andi bar and maganme graph i ad bis emmigrass the democrat bars in combent a republican and this is a race that really should not be on the map this is a race that should be republican territory and if this race is a u seem magrapha leading for much of the night and de democratis doing well there that's a pretty good sign that the wave will be building Inference took 41.082s for 48.489s audio file.
Whenusing bandfilter:
ffmpeg -i midterm-update-clipped.wav -acodec pcm_s16le -ac 1 -ar 16000 -af lowpass=3000,highpass=200 midterm-update-clipped3.wav
I get a slightly better transcription:
TensorFlow: v1.11.0-9-g97d851f DeepSpeech: v0.3.0-0-gef6b5bd Loaded model in 0.0128s. Loading language model from files models/lm2.binary models/trie2 Loaded language model in 0.000105s. Running inference. why early on in the night i mean i think there are a couple states that are going to be really keep can tucky and virginia contucky closes its poles a half in the the eastern times own half in the central times on so that means that half of the states it six o'clock atfe vits ad seven o'clock ah and toll have a lot of results and an waschings one particular congressional district race o six congressial district between a andi bar and maganmc graph i ed es emmograss the democrat bars in combent a republican and this is a race that really should not be on the map this is a race that should be republican territory and if this race is a you seem mograph leading for much of the night and te democratis doing well there thats a pretty good sign that the wave will be building Inference took 43.352s for 48.489s audio file.
If anyone knows tricks to further improve results I would be really interested :)
Mozilla DeepSpeech will sometimes create long runs of text with no spaces:
This happens even with short audio clips (4 seconds) with a native American english speaker recorded using a high quality microphone in Mac OS X laptops. I've isolated the problem to interaction with the language model rather than the acoustic model or length of audio clips, as the problem goes away when the language model is turned off.
The problem might be related to encountering out-of-vocabulary terms.
I’ve put together test files with results that show the issue is related to the language model somehow rather than the length of the audio or the acoustic model.
I’ve provided 10 chunked WAV files at 16khz 16 bit depth, each 4 seconds long, that are a subset of a fuller 15 minute audio file (I have not provided that full 15 minute file, as a few shorter reproducible chunks are sufficient to reproduce the problem):
https://www.dropbox.com/sh/3qy65r6wo8ldtvi/AAAAVinsD_kcCi8Bs6l3zOWFa?dl=0
The audio segments deliberately include occasional out-of-vocabulary terms, mostly technical, such as “OKR”, “EdgeStore”, “CAPE”, etc.
Also in that folder are several text files that show the output with the standard language model being used, showing the garbled words together (
chunks_with_language_model.txt
):Then, I’ve provided similar output with the language model turned off (
chunks_without_language_model.txt
):I’ve included both these files in the shared Dropbox folder link above.
Here’s what the correct transcript should be, manually done (
chunks_correct_manual_transcription.txt
):This shows the language model is the source of this problem; I’ve seen anecdotal reports from the official message base and blog posts that this is a wide spread problem. Perhaps when the language model hits an unknown n-gram, it ends up combining all of them together rather than retaining the space between them.
Discussion around this bug started on the standard DeepSpeech discussion forum: https://discourse.mozilla.org/t/text-produced-has-long-strings-of-words-with-no-spaces/24089/13 https://discourse.mozilla.org/t/longer-audio-files-with-deep-speech/22784/3
The standard
client.py
was slightly modified to segment the longer 15 minute audio clip into 4 second blocks.Mac OS X 10.12.6 (16G1036)
Both Mozilla DeepSpeech and TensorFlow were installed into a virtualenv setup via the following requirements.txt file:
Did not compile from source.
Same
Used CPU only version
Used CPU only version
I haven't provided my full modified
client.py
that segments longer audio, but to run with a language model using the standarddeepspeech
command against a known 4 seconds audio clip included in the Dropbox folder shared above you can run the following:This is clearly a bug and not a feature :)