Closed yoann1995 closed 6 years ago
Facing the same issue on MacOS Sierra
We've very recently switched the format of our language model and tooling on master. It looks like you're using the pre-built generate_trie binary from the 0.1.1 release. You can fix this by using consistent versions of the various tools, including the training code. If you want to use the 0.1.1 binaries, work from the v0.1.1 tag, not the master branch. Alternatively you can work from master and build things yourself.
Hope this helps.
Thank you for your reply ! I tried to use the V.0.1.1 tag but I still have the same issue.
If I want to stay on the master branch, which tools can I use to generate langage model with this new format ? (Because kenlm seems to use only probing hash tables).
KenLM does not use probing hash tables only: https://kheafield.com/code/kenlm/structures/
Depending on which structure you use for your KenLM model you'll need to adjust the typedefs in beam_search.h
and generate_trie.cpp
.
Build generate_trie
following the normal commands in the documentation.
Thank you for your reply ! I tried to use the V.0.1.1 tag but I still have the same issue.
That should not happen if you're using generate_trie
of version v0.1.1 as well.
I followed the step to build the generate_trie in the v0.1.1 version but I don't have any generated_trie file created at the end. I must miss something. I can only have this file by using the command :
python util/taskcluster.py --target .
But if I understand well, this will give me the generate_trie file which use the quantization and array-compressed pointers format ?
@yoann1995 Downloading from taskcluster should get your the newer file format, so looking at your issue it should fix it :)
No because the newer file format is : quantization and array-compressed pointers and I need probing hash tables. :/
@yoann1995 Why don't you get the binary from v0.1.1 then ?
Well, I try to do it from 0.1.1 but I still have the same issue. Thank you for the link, it works now ! Sorry for the inconvenience and thank you for your help ! :)
Still, it'd be good we figure out why master you cannot get that. Maybe we need to fix some docs ? We did the change rather soon, as @reuben said, so it's possible we lack something :), and I don't remember the flags for KenLM. Is it possible the binary
being referred to is actually lm.binary
and not the generate_trie
binary ?
The generate_trie binary is not really the problem for me. The real problem is that I don't know how to generate my lm.binary file other than with the probing hash tables format.
Yeah, which makes sense since Vincent's guide is old now :-). I should still have a kenlm build ready that I used to test that, I'll try and find the proper arguments :-)
Okay, thanks ! :)
@yoann1995
Usage: ./build_binary [-u log10_unknown_probability] [-s] [-i] [-w mmap|after] [-p probing_multiplier] [-T trie_temporary] [-S trie_building_mem] [-q bits] [-b bits] [-a bits] [type] input.arpa [output.mmap]
So make sure [type]
is trie
and not probing
, I think this is what you need.
Thank you for your reply :) This worked for me on Mac - https://index.taskcluster.net/v1/task/project.deepspeech.deepspeech.native_client.v0.1.1.osx/artifacts/public/native_client.tar.xz
This works for all other Linux systems - https://index.taskcluster.net/v1/task/project.deepspeech.deepspeech.native_client.v0.1.1.cpu/artifacts/public/native_client.tar.xz
I'm having an issue with building trie model with generate_trie script! Only issue I'm having is; building process just get killed after using almost 40GB of memory! My binary language model is of size 1.8 GB, Vocab file is around 25GB. Is there any way to determine how much memory is required to generate trie file?
@spate141 I'd rather question the size of your files, it looks like way too big.
@lissyx Thanks for the reply! Yes, the file size was the issue; so I created smaller version of language model(ngram 1=2782949 ngram 2=2874575 ngram 3=1665316)
from a small vocab.txt file, and I was able to generate a new trie
file along with it.
One more thing I noticed, If I use my newly trained language model with DeepSpeech's trie model file I'm getting more accurate transcriptions rather than If I use my custom lm with a newly generated trie model file. I'm not sure about this behavior. Any thoughts?
@spate141 Can we take that discussion on Discourse ?
@lissyx: sure! I'll add the link here.
Discourse: https://discourse.mozilla.org/t/lm-trie-performance/29544
I'm using v0.2.0-alpha.6 native_client
binary files, and getting similar type of error:
My kenlm language model is based on trie; here command:
./build_binary trie lm.arpa lm.binary
Error on generate_trie
:
./generate_trie alphabet.txt lm.binary vocab.txt trie
terminate called after throwing an instance of 'lm::FormatLoadException'
what(): native_client/kenlm/lm/binary_format.cc:131 in void lm::ngram::MatchCheck(lm::ngram::ModelType, unsigned int, const lm::ngram::Parameters&) threw FormatLoadException.
The binary file was built for trie but the inference code is trying to load trie with quantization and array-compressed pointers
Aborted (core dumped)
@spate141 Yes, the format changed, use the proper parameters to rebuild the language model, see my comment above.
@spate141 , this command for kenlm build_binary fixed the issue: bin/build_binary trie -q 16 -b 7 -a 64 set5.arpa set5.lm.binary
@lissyx
$ ./generate_trie ../tensorflow/DeepSpeech/data/alphabet.txt lm.binary set5.txt /tmp/trie_lm_trie_long
terminate called after throwing an instance of 'boost::locale::conv::conversion_error' what(): Conversion failed Aborted
However, if I use your vocab.txt (available in the DeepSpeech/data/lm/vocab.txt (Num lines = 94k), with my lm.binary, the generate_trie is able to generate a trie for me successfully.
I have observed the same behavior even with version 0.1.1 of your generate_trie (that is, when I supply a vocab = 4 Million lines, it would throw this error whereas when I supply your vocab, it would generate the trie). So, this is not an issue with generate_trie version.
My question to you:
@abuvaneswari Please avoid hijacking issues for questions like that, please use proper code formatting for including code / console output.
I have no idea why you have this boost::locale
error, likely some bogus data somewhere. Since it works with our data, you need to find what is wrong on your side.
The file vocab.txt
is used to build the language model lm.binary
.
The official model release were using some vocab.txt
file we were unable to redistribute ; not sure if it was TED talks. This is changing with 0.2.0.
You might loose accuracy and time debugging it by using another vocab.txt
. We spent weeks training and comparing different dataset and build parameters for selecting the best compromise.
This is still broken - can we re-open this issue and fix the problem? The current release does NOT work.
@Cloudmersive Other got this to work[1] so I'd guess you're doing something different to what they are doing. Could you describe what you've done?
I am having the same error.
terminate called after throwing an instance of 'lm::FormatLoadException'
what(): native_client/kenlm/lm/binary_format.cc:131 in void lm::ngram::MatchCheck(lm::ngram::ModelType, unsigned int, const lm::ngram::Parameters&) threw FormatLoadException.
The binary file was built for probing hash tables but the inference code is trying to load trie with quantization and array-compressed pointers
Aborted (core dumped)
Deepspeech current Master 0.2.1.Alpha 0. I have created the Urdu language, language model in binary formart. Using given link : https://yidatao.github.io/2017-05-31/kenlm-ngram/ next, I want to generate trie. I have native client in deepspeech current master.
Give me solution kindly. or identify my mistake. Thank you!
I followed given link to generate trie but got errors. https://discourse.mozilla.org/t/tutorial-how-i-trained-a-specific-french-model-to-control-my-robot/22830
@yoann1995
Usage: ./build_binary [-u log10_unknown_probability] [-s] [-i] [-w mmap|after] [-p probing_multiplier] [-T trie_temporary] [-S trie_building_mem] [-q bits] [-b bits] [-a bits] [type] input.arpa [output.mmap]
So make sure
[type]
istrie
and notprobing
, I think this is what you need.
does it to generate trie? It would need trained model which I didn't have yet. I believe that generating trie is prior process than training. what to put in output.mmp or does it just like that ?
Give me solution kindly. or identify my mistake.
Read the error and try to understand it ; it's telling you exactly where the problem lies:
The binary file was built for probing hash tables but the inference code is trying to load trie with quantization and array-compressed pointers
You are using mismatching things, but since you don't document what you do, how you do, then we cannot help you.
Deepspeech current Master 0.2.1.Alpha 0. I followed given link to generate trie but got errors.
Help yourself and others, share more details ...
We've very recently switched the format of our language model and tooling on master. It looks like you're using the pre-built generate_trie binary from the 0.1.1 release. You can fix this by using consistent versions of the various tools, including the training code. If you want to use the 0.1.1 binaries, work from the v0.1.1 tag, not the master branch. Alternatively you can work from master and build things yourself.
Hope this helps.
I am using DeepSpeech master,version 0.2.1 Alpha 0. But I am facing the same challenge.
We've very recently switched the format of our language model and tooling on master. It looks like you're using the pre-built generate_trie binary from the 0.1.1 release. You can fix this by using consistent versions of the various tools, including the training code. If you want to use the 0.1.1 binaries, work from the v0.1.1 tag, not the master branch. Alternatively you can work from master and build things yourself. Hope this helps.
I am using DeepSpeech master,version 0.2.1 Alpha 0. But I am facing the same challenge.
You are facing the same challenge, but have not yet provided us with more details so that we can help you. If you cannot be more constructive, I cannot see how we can help you.
So @Hafsa26 when do you think you can share us more details on your environment and what you do so we can start helping you ?
Using given link : https://yidatao.github.io/2017-05-31/kenlm-ngram/
So @Hafsa26 that's not documentation we wrote, it's unlikely to be matching our usecase. However, if you read the documentation, we explicitely explain and instruct you how to do it: https://github.com/mozilla/DeepSpeech/blob/v0.2.1-alpha.0/data/lm/README.md
I am using Deepspeech current Master 0.2.1.Alpha 0 and installed all its requirements. My installations worked finely with Common voice data sets. I did training as well of that datatset. Now, I am working on Urdu Language dataset cointaing 708 sentences. I prepared different directory of test, train and dev as well as .csv files of these audio files along their transcription/ the standard way of preparing data for deepspeech. I have created the Urdu language, language model in binary format. Using given link : https://yidatao.github.io/2017-05-31/kenlm-ngram/ next, I want to generate trie. I have native client in deepspeech current master. I was getting an error terminate called after throwing an instance of 'lm::FormatLoadException' what(): native_client/kenlm/lm/binary_format.cc:131 in void lm::ngram::MatchCheck(lm::ngram::ModelType, unsigned int, const lm::ngram::Parameters&) threw FormatLoadException. The binary file was built for probing hash tables but the inference code is trying to load trie with quantization and array-compressed pointers' Aborted (core dumped) I removed this error by changing binary file into ./build/bin/build_binary trie -a 64 -q 8 -b 7 urdu_3gram.arpa lm.binary using https://kheafield.com/code/kenlm/structures/ Now, The build lm.binary has quantization and pointer compression. But now, I am getting an error of Invalid label Aborted (core dumped)
Kindly help. or identify my mistake. Thank you!
@lissyx My all other files with Urdu language data are also prepared. Help me in it. Thank you!
A text trie file is made in the result of below command only 1 is written in it. /home/rc//Downloads/DeepSpeech-master/native_client/generate_trie /home/rc/Downloads/DeepSpeech-master/data/alphabet.txt /home/rc/Downloads/DeepSpeech-master/data/lm/lm.binary /home/rc/Downloads/DeepSpeech-master/data/vocab.txt /home/rc/Downloads/DeepSpeech-master/data/trie @lissyx
Let me know if something is missing in order to identify the error. I will give more details about that then. Thank you! @lissyx
Invalid label
This usually means there's some missing character in the alphabet.
@lissyx Okay. I'm gonna check it. Thanks!
Am I on the right track to get the trie file ?
Am I on the right track to get the trie file ?
No, but that's because you don't want to read the documenation. You are passing too many arguments to generate_trie
...
Using given link : https://yidatao.github.io/2017-05-31/kenlm-ngram/
I've already said that this is not the proper documentation to follow. If you insist on using unrelated documentation when we have the proper one in the tree under data/lm/README.md
, we cannot continue to help you.
I read that. I got from the code that trie file is quantized as well as has pointer compression; and then trie is build.
Sorry the binary file to be built is quantized and had pointer compression. Thats what I did. Previously, my lm.binary has probe hash tables, now I changed it to trie along quantization and pointer compression.
This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.
I'm following this tutorial : https://discourse.mozilla.org/t/tutorial-how-i-trained-a-specific-french-model-to-control-my-robot/22830 to create a French model.
The problem is when generating the trie file with this command :
I have this output :
I tried several times to generate my lm.binary with kenlm (./build_binary -T -s words.arpa lm.binary) but still the same error.