Generate trie lm::FormatLoadException

yoann1995 commented 6 years ago

I'm following this tutorial : https://discourse.mozilla.org/t/tutorial-how-i-trained-a-specific-french-model-to-control-my-robot/22830 to create a French model.

The problem is when generating the trie file with this command :

./generate_trie data/cassia/alphabet.txt data/cassia/lm.binary data/cassia/vocabulary.txt data/cassia/trie

I have this output :

terminate called after throwing an instance of 'lm::FormatLoadException' what(): native_client/kenlm/lm/binary_format.cc:131 in void lm::ngram::MatchCheck(lm::ngram::ModelType, unsigned int, const lm::ngram::Parameters&) threw FormatLoadException. The binary file was built for probing hash tables but the inference code is trying to load trie with quantization and array-compressed pointers Abandon (core dumped)

I tried several times to generate my lm.binary with kenlm (./build_binary -T -s words.arpa lm.binary) but still the same error.

hhk998402 commented 6 years ago

Facing the same issue on MacOS Sierra

reuben commented 6 years ago

We've very recently switched the format of our language model and tooling on master. It looks like you're using the pre-built generate_trie binary from the 0.1.1 release. You can fix this by using consistent versions of the various tools, including the training code. If you want to use the 0.1.1 binaries, work from the v0.1.1 tag, not the master branch. Alternatively you can work from master and build things yourself.

Hope this helps.

yoann1995 commented 6 years ago

Thank you for your reply ! I tried to use the V.0.1.1 tag but I still have the same issue.

If I want to stay on the master branch, which tools can I use to generate langage model with this new format ? (Because kenlm seems to use only probing hash tables).

reuben commented 6 years ago

KenLM does not use probing hash tables only: https://kheafield.com/code/kenlm/structures/

Depending on which structure you use for your KenLM model you'll need to adjust the typedefs in beam_search.h and generate_trie.cpp.

Build generate_trie following the normal commands in the documentation.

reuben commented 6 years ago

Thank you for your reply ! I tried to use the V.0.1.1 tag but I still have the same issue.

That should not happen if you're using generate_trie of version v0.1.1 as well.

yoann1995 commented 6 years ago

I followed the step to build the generate_trie in the v0.1.1 version but I don't have any generated_trie file created at the end. I must miss something. I can only have this file by using the command :

python util/taskcluster.py --target .

But if I understand well, this will give me the generate_trie file which use the quantization and array-compressed pointers format ?

lissyx commented 6 years ago

@yoann1995 Downloading from taskcluster should get your the newer file format, so looking at your issue it should fix it :)

yoann1995 commented 6 years ago

No because the newer file format is : quantization and array-compressed pointers and I need probing hash tables. :/

lissyx commented 6 years ago

@yoann1995 Why don't you get the binary from v0.1.1 then ?

lissyx commented 6 years ago

https://index.taskcluster.net/v1/task/project.deepspeech.deepspeech.native_client.v0.1.1.cpu/artifacts/public/native_client.tar.xz

yoann1995 commented 6 years ago

Well, I try to do it from 0.1.1 but I still have the same issue. Thank you for the link, it works now ! Sorry for the inconvenience and thank you for your help ! :)

lissyx commented 6 years ago

Still, it'd be good we figure out why master you cannot get that. Maybe we need to fix some docs ? We did the change rather soon, as @reuben said, so it's possible we lack something :), and I don't remember the flags for KenLM. Is it possible the binary being referred to is actually lm.binary and not the generate_trie binary ?

yoann1995 commented 6 years ago

The generate_trie binary is not really the problem for me. The real problem is that I don't know how to generate my lm.binary file other than with the probing hash tables format.

lissyx commented 6 years ago

Yeah, which makes sense since Vincent's guide is old now :-). I should still have a kenlm build ready that I used to test that, I'll try and find the proper arguments :-)

yoann1995 commented 6 years ago

Okay, thanks ! :)

lissyx commented 6 years ago

@yoann1995

Usage: ./build_binary [-u log10_unknown_probability] [-s] [-i] [-w mmap|after] [-p probing_multiplier] [-T trie_temporary] [-S trie_building_mem] [-q bits] [-b bits] [-a bits] [type] input.arpa [output.mmap]

So make sure [type] is trie and not probing, I think this is what you need.

hhk998402 commented 6 years ago

Thank you for your reply :) This worked for me on Mac - https://index.taskcluster.net/v1/task/project.deepspeech.deepspeech.native_client.v0.1.1.osx/artifacts/public/native_client.tar.xz

This works for all other Linux systems - https://index.taskcluster.net/v1/task/project.deepspeech.deepspeech.native_client.v0.1.1.cpu/artifacts/public/native_client.tar.xz

spate141 commented 6 years ago

I'm having an issue with building trie model with generate_trie script! Only issue I'm having is; building process just get killed after using almost 40GB of memory! My binary language model is of size 1.8 GB, Vocab file is around 25GB. Is there any way to determine how much memory is required to generate trie file?

lissyx commented 6 years ago

@spate141 I'd rather question the size of your files, it looks like way too big.

spate141 commented 6 years ago

@lissyx Thanks for the reply! Yes, the file size was the issue; so I created smaller version of language model(ngram 1=2782949 ngram 2=2874575 ngram 3=1665316) from a small vocab.txt file, and I was able to generate a new trie file along with it.

One more thing I noticed, If I use my newly trained language model with DeepSpeech's trie model file I'm getting more accurate transcriptions rather than If I use my custom lm with a newly generated trie model file. I'm not sure about this behavior. Any thoughts?

lissyx commented 6 years ago

@spate141 Can we take that discussion on Discourse ?

spate141 commented 6 years ago

@lissyx: sure! I'll add the link here.

Discourse: https://discourse.mozilla.org/t/lm-trie-performance/29544

spate141 commented 6 years ago

I'm using v0.2.0-alpha.6 native_client binary files, and getting similar type of error:

My kenlm language model is based on trie; here command: ./build_binary trie lm.arpa lm.binary

Error on generate_trie:

./generate_trie alphabet.txt lm.binary vocab.txt trie

terminate called after throwing an instance of 'lm::FormatLoadException'
  what():  native_client/kenlm/lm/binary_format.cc:131 in void lm::ngram::MatchCheck(lm::ngram::ModelType, unsigned int, const lm::ngram::Parameters&) threw FormatLoadException.
The binary file was built for trie but the inference code is trying to load trie with quantization and array-compressed pointers
Aborted (core dumped)

lissyx commented 6 years ago

@spate141 Yes, the format changed, use the proper parameters to rebuild the language model, see my comment above.

abuvaneswari commented 6 years ago

@spate141 , this command for kenlm build_binary fixed the issue: bin/build_binary trie -q 16 -b 7 -a 64 set5.arpa set5.lm.binary

abuvaneswari commented 6 years ago

@lissyx

Using the generate_trie of v0.2.0 alpha6
using the alphabet.txt in DeepSpeech/data/alphabet.txt
using an lm.binary (trie format) that I generated after following instructions above
using vocab that is essentially the text source to the lm.binary generation. The num lines of my vocab.txt = 4 Million

$ ./generate_trie ../tensorflow/DeepSpeech/data/alphabet.txt lm.binary set5.txt /tmp/trie_lm_trie_long

terminate called after throwing an instance of 'boost::locale::conv::conversion_error' what(): Conversion failed Aborted

However, if I use your vocab.txt (available in the DeepSpeech/data/lm/vocab.txt (Num lines = 94k), with my lm.binary, the generate_trie is able to generate a trie for me successfully.

I have observed the same behavior even with version 0.1.1 of your generate_trie (that is, when I supply a vocab = 4 Million lines, it would throw this error whereas when I supply your vocab, it would generate the trie). So, this is not an issue with generate_trie version.

My question to you:

What essentially should go into the vocab file? What is the role played by the vocab file? Is n't the lm.binary alone enough to do the language model lookup?
I believe the DeepSpeech/data/lm/vocab file consists of all transcripts from TED talks. Do you use this same file or a different one when you generate the trie for DS models official release?
What do I lose if I use the DeepSpeech/data/lm/vocab.txt in the place of my own 4 Million line long vocab file to generate trie?

lissyx commented 6 years ago

@abuvaneswari Please avoid hijacking issues for questions like that, please use proper code formatting for including code / console output.

I have no idea why you have this boost::locale error, likely some bogus data somewhere. Since it works with our data, you need to find what is wrong on your side.

The file vocab.txt is used to build the language model lm.binary.

The official model release were using some vocab.txt file we were unable to redistribute ; not sure if it was TED talks. This is changing with 0.2.0.

You might loose accuracy and time debugging it by using another vocab.txt. We spent weeks training and comparing different dataset and build parameters for selecting the best compromise.

Cloudmersive commented 6 years ago

This is still broken - can we re-open this issue and fix the problem? The current release does NOT work.

kdavis-mozilla commented 6 years ago

@Cloudmersive Other got this to work[1] so I'd guess you're doing something different to what they are doing. Could you describe what you've done?

Hafsa26 commented 5 years ago

I am having the same error.

terminate called after throwing an instance of 'lm::FormatLoadException'
  what():  native_client/kenlm/lm/binary_format.cc:131 in void lm::ngram::MatchCheck(lm::ngram::ModelType, unsigned int, const lm::ngram::Parameters&) threw FormatLoadException.
The binary file was built for probing hash tables but the inference code is trying to load trie with quantization and array-compressed pointers
Aborted (core dumped)

Deepspeech current Master 0.2.1.Alpha 0. I have created the Urdu language, language model in binary formart. Using given link : https://yidatao.github.io/2017-05-31/kenlm-ngram/ next, I want to generate trie. I have native client in deepspeech current master.

Give me solution kindly. or identify my mistake. Thank you!

Hafsa26 commented 5 years ago

I followed given link to generate trie but got errors. https://discourse.mozilla.org/t/tutorial-how-i-trained-a-specific-french-model-to-control-my-robot/22830

Hafsa26 commented 5 years ago

@yoann1995

Usage: ./build_binary [-u log10_unknown_probability] [-s] [-i] [-w mmap|after] [-p probing_multiplier] [-T trie_temporary] [-S trie_building_mem] [-q bits] [-b bits] [-a bits] [type] input.arpa [output.mmap]

So make sure [type] is trie and not probing, I think this is what you need.

does it to generate trie? It would need trained model which I didn't have yet. I believe that generating trie is prior process than training. what to put in output.mmp or does it just like that ?

lissyx commented 5 years ago

Give me solution kindly. or identify my mistake.

Read the error and try to understand it ; it's telling you exactly where the problem lies:

The binary file was built for probing hash tables but the inference code is trying to load trie with quantization and array-compressed pointers

You are using mismatching things, but since you don't document what you do, how you do, then we cannot help you.

Deepspeech current Master 0.2.1.Alpha 0. I followed given link to generate trie but got errors.

Help yourself and others, share more details ...

Hafsa26 commented 5 years ago

We've very recently switched the format of our language model and tooling on master. It looks like you're using the pre-built generate_trie binary from the 0.1.1 release. You can fix this by using consistent versions of the various tools, including the training code. If you want to use the 0.1.1 binaries, work from the v0.1.1 tag, not the master branch. Alternatively you can work from master and build things yourself.

Hope this helps.

I am using DeepSpeech master,version 0.2.1 Alpha 0. But I am facing the same challenge.

lissyx commented 5 years ago

We've very recently switched the format of our language model and tooling on master. It looks like you're using the pre-built generate_trie binary from the 0.1.1 release. You can fix this by using consistent versions of the various tools, including the training code. If you want to use the 0.1.1 binaries, work from the v0.1.1 tag, not the master branch. Alternatively you can work from master and build things yourself. Hope this helps.

I am using DeepSpeech master,version 0.2.1 Alpha 0. But I am facing the same challenge.

You are facing the same challenge, but have not yet provided us with more details so that we can help you. If you cannot be more constructive, I cannot see how we can help you.

lissyx commented 5 years ago

So @Hafsa26 when do you think you can share us more details on your environment and what you do so we can start helping you ?

lissyx commented 5 years ago

Using given link : https://yidatao.github.io/2017-05-31/kenlm-ngram/

So @Hafsa26 that's not documentation we wrote, it's unlikely to be matching our usecase. However, if you read the documentation, we explicitely explain and instruct you how to do it: https://github.com/mozilla/DeepSpeech/blob/v0.2.1-alpha.0/data/lm/README.md

Hafsa26 commented 5 years ago

I am using Deepspeech current Master 0.2.1.Alpha 0 and installed all its requirements. My installations worked finely with Common voice data sets. I did training as well of that datatset. Now, I am working on Urdu Language dataset cointaing 708 sentences. I prepared different directory of test, train and dev as well as .csv files of these audio files along their transcription/ the standard way of preparing data for deepspeech. I have created the Urdu language, language model in binary format. Using given link : https://yidatao.github.io/2017-05-31/kenlm-ngram/ next, I want to generate trie. I have native client in deepspeech current master. I was getting an error terminate called after throwing an instance of 'lm::FormatLoadException' what(): native_client/kenlm/lm/binary_format.cc:131 in void lm::ngram::MatchCheck(lm::ngram::ModelType, unsigned int, const lm::ngram::Parameters&) threw FormatLoadException. The binary file was built for probing hash tables but the inference code is trying to load trie with quantization and array-compressed pointers' Aborted (core dumped) I removed this error by changing binary file into ./build/bin/build_binary trie -a 64 -q 8 -b 7 urdu_3gram.arpa lm.binary using https://kheafield.com/code/kenlm/structures/ Now, The build lm.binary has quantization and pointer compression. But now, I am getting an error of Invalid label Aborted (core dumped)

Kindly help. or identify my mistake. Thank you!

Hafsa26 commented 5 years ago

@lissyx My all other files with Urdu language data are also prepared. Help me in it. Thank you!

Hafsa26 commented 5 years ago

A text trie file is made in the result of below command only 1 is written in it. /home/rc//Downloads/DeepSpeech-master/native_client/generate_trie /home/rc/Downloads/DeepSpeech-master/data/alphabet.txt /home/rc/Downloads/DeepSpeech-master/data/lm/lm.binary /home/rc/Downloads/DeepSpeech-master/data/vocab.txt /home/rc/Downloads/DeepSpeech-master/data/trie @lissyx

Hafsa26 commented 5 years ago

Let me know if something is missing in order to identify the error. I will give more details about that then. Thank you! @lissyx

lissyx commented 5 years ago

Invalid label 

This usually means there's some missing character in the alphabet.

Hafsa26 commented 5 years ago

@lissyx Okay. I'm gonna check it. Thanks!

Hafsa26 commented 5 years ago

Am I on the right track to get the trie file ?

lissyx commented 5 years ago

Am I on the right track to get the trie file ?

No, but that's because you don't want to read the documenation. You are passing too many arguments to generate_trie ...

lissyx commented 5 years ago

Using given link : https://yidatao.github.io/2017-05-31/kenlm-ngram/

I've already said that this is not the proper documentation to follow. If you insist on using unrelated documentation when we have the proper one in the tree under data/lm/README.md, we cannot continue to help you.

Hafsa26 commented 5 years ago

I read that. I got from the code that trie file is quantized as well as has pointer compression; and then trie is build.

Hafsa26 commented 5 years ago

Sorry the binary file to be built is quantized and had pointer compression. Thats what I did. Previously, my lm.binary has probe hash tables, now I changed it to trie along quantization and pointer compression.

lock[bot] commented 5 years ago

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

mozilla / DeepSpeech

Generate trie lm::FormatLoadException #1407