Closed cailurus closed 5 years ago
sir @ailurus1991 i think trie file not generate correctly. it is empty file. sir check once.
python3 util/taskcluster.py --branch " anything" --target new_native_client/
new_native_client/generate_trie alphabet.txt lm.binary vocab.txt trie
previously i was followed this command
@MuruganR96 hey dude thanks for ur reply. I found your generate_trie command have four args?
emmm, seems like the usage is :Usage: ./dest/generate_trie <alphabet> <lm_model> <trie_path>
? I didn't find the vocab.txt arg...
BTW, I've tested this on a small english corpus still got segmentation fault... :(
i think my suggestion @ailurus1991 sir, did you create your lm.arpa for your vocab.txt? and then lm.arpa convert binary format lm.binary. you will give input for this lm.binary with vocab.txt. it will generate correct trie file. otherwise try /data/lm/READme.txt. it is also very clear docs. (or) some memory consumption error.
@MuruganR96 yup, I followed the tutorial, and I created the arpa file by ./lmplz --text lm_corpus.txt --arpa words.arpa --o 4
and converted the arpa to binary by ./build_binary -T -s words.arpa lm.binary
. I didn't see any error output, so I think I got the right language model binary file...
Did i miss something?
Yes, share us more information. There's nothing we can do right now except telling you it works for us. Like your system, a small testcase etc etc. Even just console output ...
@ailurus1991 - could be similar to #1745 ?
I guess the problem is bin file generated by KenLM. @lissyx
the alphabet file: https://github.com/ailurus1991/ds_files/blob/master/thchs30-alphabet.txt (generated from vocab) vocab: https://github.com/ailurus1991/ds_files/blob/master/thchs30-vocabulary.txt
Then I use KenLM to generate the arpa and lm binary files:
./kenlm/build/bin/lmplz -o 3 --text zh/thchs30-vocabulary.txt --arpa zh/thu.arpa
=== 1/5 Counting and sorting n-grams ===
Reading /home/guokr/lm/zh/thchs30-vocabulary.txt
----5---10---15---20---25---30---35---40---45---50---55---60---65---70---75---80---85---90---95--100
****************************************************************************************************
Unigram tokens 32625 types 2886
=== 2/5 Calculating and sorting adjusted counts ===
Chain sizes: 1:34632 2:18771410944 3:35196395520
Statistics:
1 2886 D1=0.491463 D2=1.109 D3+=1.52561
2 25348 D1=0.840792 D2=1.1888 D3+=1.54784
3 31456 D1=0.954916 D2=1.63496 D3+=2.04508
Memory estimate for binary LM:
type kB
probing 1220 assuming -p 1.5
probing 1380 assuming -r models -p 1.5
trie 511 without quantization
trie 280 assuming -q 8 -b 8 quantization
trie 487 assuming -a 22 array pointer compression
trie 256 assuming -a 22 -q 8 -b 8 array pointer compression and quantization
=== 3/5 Calculating and sorting initial probabilities ===
Chain sizes: 1:34632 2:405568 3:629120
----5---10---15---20---25---30---35---40---45---50---55---60---65---70---75---80---85---90---95--100
####################################################################################################
=== 4/5 Calculating and writing order-interpolated probabilities ===
Chain sizes: 1:34632 2:405568 3:629120
----5---10---15---20---25---30---35---40---45---50---55---60---65---70---75---80---85---90---95--100
####################################################################################################
=== 5/5 Writing ARPA model ===
----5---10---15---20---25---30---35---40---45---50---55---60---65---70---75---80---85---90---95--100
****************************************************************************************************
Name:lmplz VmPeak:52884572 kB VmRSS:7136 kB RSSMax:12186844 kB user:0.769657 sys:2.89022 CPU:3.65991 real:3.64151
└─[0] <> ./kenlm/build/bin/build_binary -T -s zh/thu.arpa zh/thu.bin
Reading zh/thu.arpa
----5---10---15---20---25---30---35---40---45---50---55---60---65---70---75---80---85---90---95--100
****************************************************************************************************
SUCCESS
Then use native _client/generate_trie to generate the trie:
get the pre-built binary:
python3 util/taskcluster.py --arch gpu --target ./native_client
and ./native_client/generate_trie ~/lm/zh/thchs30-alphabet.txt ~/lm/zh/thu.bin ~/lm/zh/trie
then I got:
[1] 27569 segmentation fault ./native_client/generate_trie ~/lm/zh/thu.bin ~/lm/zh/trie
BTW, I tried a english bible corpus, everything worked well.
I guess the problem is bin file generated by KenLM. @lissyx
Sure, but no stack trace, nothing we can work on so far. Can you share more informations ?
@lissyx I found an interesting thing. I added an english word in my Chinese vocab file, like this: https://github.com/ailurus1991/ds_files/blob/master/thchs30-vocabulary.txt
no seg fault anymore, so the problem is the encoding or something?
no seg fault anymore, so the problem is the encoding or something?
Encoding, I doubt, but if you can try to investigate a bit more or share better STR and content for us to reproduce, we could fix that, maybe ?
@lissyx I meet the same error here.
我
你 我
他
我
你
我
你
他
lmplz -o 2 --text vocabulary.txt --arpa words.arpa
build_binary -T -s words.arpa lm.binary
generate_trie alphabet.txt lm.binary trie
last I meet the same error.
439388 segmentation fault ~/native_client/generate_trie alphabet.txt lm.binary trie
then I run this command in gdb and get some debug info.
Program received signal SIGSEGV, Segmentation fault.
0x0000000000412431 in Scorer::save_dictionary(std::string const&) ()
@x0day Can you share your files so we can investigate ?
@lissyx what files?
you can try with vocabulary.txt and alphabet.txt I uploaded.
and download the native client with command python util/taskcluster.py --branch v0.4.0-alpha.3 --target ~/native_client/
and the latest kenlm.
@lissyx what files?
you can try with vocabulary.txt and alphabet.txt I uploaded. and download the native client with command
python util/taskcluster.py --branch v0.4.0-alpha.3 --target ~/native_client/
and the latest kenlm.
I would also like to have yours lm.binary
and trie
I'm not even able to reproduce your lm.binary
:
alexandre@serveur:~/tmp/KenLM/issue1756$ ../kenlm-build/bin/lmplz -o 2 --text vocabulary.issue1756.txt --arpa words.arpa
=== 1/5 Counting and sorting n-grams ===
Reading /home/alexandre/tmp/KenLM/issue1756/vocabulary.issue1756.txt
----5---10---15---20---25---30---35---40---45---50---55---60---65---70---75---80---85---90---95--100
****************************************************************************************************
Unigram tokens 6 types 6
=== 2/5 Calculating and sorting adjusted counts ===
Chain sizes: 1:72 2:53974548480
/home/alexandre/tmp/KenLM/kenlm/lm/builder/adjust_counts.cc:52 in void lm::builder::{anonymous}::StatCollector::CalculateDiscounts(const lm::builder::DiscountConfig&) threw BadDiscountException because `s.n[j] == 0'.
Could not calculate Kneser-Ney discounts for 1-grams with adjusted count 4 because we didn't observe any 1-grams with adjusted count 3; Is this small or artificial data?
Try deduplicating the input. To override this error for e.g. a class-based model, rerun with --discount_fallback
Abandon
Ok, one need --discount_fallback
as well.
lmplz -o 2 --text vocabulary.txt --arpa words.arpa
build_binary -T -s words.arpa lm.binary
This is just plain wrong, please checkout the documentation on how to rebuild a language model: https://github.com/mozilla/DeepSpeech/blob/master/data/lm/README.md
lmplz -o 2 --text vocabulary.txt --arpa words.arpa
build_binary -T -s words.arpa lm.binary
This is just plain wrong, please checkout the documentation on how to rebuild a language model: https://github.com/mozilla/DeepSpeech/blob/master/data/lm/README.md
Ok, with proper arguments for build_binary
I reproduce the segfault, and at the same place.
@lissyx I can't generate the trie file. you can check the lm.binary and the arpa file.
This should be fixed with the latest master. Please reopen this issue if you still see the bug with the new code.
@lissyx Hi, In the lastest version v0.4.1. I used the vocabulary and the alphabet from @ailurus1991 to test. https://github.com/ailurus1991/ds_files/blob/master/thchs30-vocabulary.txt https://github.com/ailurus1991/ds_files/blob/master/thchs30-alphabet.txt
The trie file is generated. But, I do not think this file is right. Because the size of this file is too small, only 45KB.
Is it an another bug?
@lissyx Hi, In the lastest version v0.4.1. I used the vocabulary and the alphabet from @ailurus1991 to test. https://github.com/ailurus1991/ds_files/blob/master/thchs30-vocabulary.txt https://github.com/ailurus1991/ds_files/blob/master/thchs30-alphabet.txt
The trie file is generated. But, I do not think this file is right. Because the size of this file is too small, only 45KB.
Is it an another bug?
That seems pretty obvious that this is not a bug. When working on that I got similar small files, I think it's fine.
This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.
Hey guys,
I'm following this tutorial to use DeepSpeech on my own data (Japanese): https://discourse.mozilla.org/t/tutorial-how-i-trained-a-specific-french-model-to-control-my-robot/22830
I created the training/valid/test csv files correctly, and I got the alphabet, vocab and lm.binary. When I was trying to create the trie I got segmentation fault...
I followed this https://github.com/mozilla/DeepSpeech/blob/master/native_client/README.md to get the pre-built binaries with util/taskcluster.py, and use ./generate_trie alphabet.txt lm.binary trie. Did i miss something?
thanks