mozilla / DeepSpeech

DeepSpeech is an open source embedded (offline, on-device) speech-to-text engine which can run in real time on devices ranging from a Raspberry Pi 4 to high power GPU servers.
Mozilla Public License 2.0
25.45k stars 3.98k forks source link

segmentation fault on generate_trie #1756

Closed cailurus closed 5 years ago

cailurus commented 6 years ago

Hey guys,

I'm following this tutorial to use DeepSpeech on my own data (Japanese): https://discourse.mozilla.org/t/tutorial-how-i-trained-a-specific-french-model-to-control-my-robot/22830

I created the training/valid/test csv files correctly, and I got the alphabet, vocab and lm.binary. When I was trying to create the trie I got segmentation fault...

I followed this https://github.com/mozilla/DeepSpeech/blob/master/native_client/README.md to get the pre-built binaries with util/taskcluster.py, and use ./generate_trie alphabet.txt lm.binary trie. Did i miss something?

thanks

MuruganR96 commented 6 years ago

sir @ailurus1991 i think trie file not generate correctly. it is empty file. sir check once.

python3 util/taskcluster.py --branch " anything" --target new_native_client/
new_native_client/generate_trie alphabet.txt lm.binary vocab.txt trie

previously i was followed this command

cailurus commented 6 years ago

@MuruganR96 hey dude thanks for ur reply. I found your generate_trie command have four args?

emmm, seems like the usage is :Usage: ./dest/generate_trie <alphabet> <lm_model> <trie_path>? I didn't find the vocab.txt arg...

BTW, I've tested this on a small english corpus still got segmentation fault... :(

MuruganR96 commented 6 years ago

i think my suggestion @ailurus1991 sir, did you create your lm.arpa for your vocab.txt? and then lm.arpa convert binary format lm.binary. you will give input for this lm.binary with vocab.txt. it will generate correct trie file. otherwise try /data/lm/READme.txt. it is also very clear docs. (or) some memory consumption error.

cailurus commented 6 years ago

@MuruganR96 yup, I followed the tutorial, and I created the arpa file by ./lmplz --text lm_corpus.txt --arpa words.arpa --o 4 and converted the arpa to binary by ./build_binary -T -s words.arpa lm.binary. I didn't see any error output, so I think I got the right language model binary file...

lissyx commented 6 years ago

Did i miss something?

Yes, share us more information. There's nothing we can do right now except telling you it works for us. Like your system, a small testcase etc etc. Even just console output ...

JRMeyer commented 6 years ago

@ailurus1991 - could be similar to #1745 ?

cailurus commented 6 years ago

I guess the problem is bin file generated by KenLM. @lissyx

the alphabet file: https://github.com/ailurus1991/ds_files/blob/master/thchs30-alphabet.txt (generated from vocab) vocab: https://github.com/ailurus1991/ds_files/blob/master/thchs30-vocabulary.txt

Then I use KenLM to generate the arpa and lm binary files:

./kenlm/build/bin/lmplz -o 3 --text zh/thchs30-vocabulary.txt --arpa zh/thu.arpa
=== 1/5 Counting and sorting n-grams ===
Reading /home/guokr/lm/zh/thchs30-vocabulary.txt
----5---10---15---20---25---30---35---40---45---50---55---60---65---70---75---80---85---90---95--100
****************************************************************************************************
Unigram tokens 32625 types 2886
=== 2/5 Calculating and sorting adjusted counts ===
Chain sizes: 1:34632 2:18771410944 3:35196395520
Statistics:
1 2886 D1=0.491463 D2=1.109 D3+=1.52561
2 25348 D1=0.840792 D2=1.1888 D3+=1.54784
3 31456 D1=0.954916 D2=1.63496 D3+=2.04508
Memory estimate for binary LM:
type      kB
probing 1220 assuming -p 1.5
probing 1380 assuming -r models -p 1.5
trie     511 without quantization
trie     280 assuming -q 8 -b 8 quantization
trie     487 assuming -a 22 array pointer compression
trie     256 assuming -a 22 -q 8 -b 8 array pointer compression and quantization
=== 3/5 Calculating and sorting initial probabilities ===
Chain sizes: 1:34632 2:405568 3:629120
----5---10---15---20---25---30---35---40---45---50---55---60---65---70---75---80---85---90---95--100
####################################################################################################
=== 4/5 Calculating and writing order-interpolated probabilities ===
Chain sizes: 1:34632 2:405568 3:629120
----5---10---15---20---25---30---35---40---45---50---55---60---65---70---75---80---85---90---95--100
####################################################################################################
=== 5/5 Writing ARPA model ===
----5---10---15---20---25---30---35---40---45---50---55---60---65---70---75---80---85---90---95--100
****************************************************************************************************
Name:lmplz      VmPeak:52884572 kB      VmRSS:7136 kB   RSSMax:12186844 kB      user:0.769657   sys:2.89022     CPU:3.65991     real:3.64151
└─[0] <> ./kenlm/build/bin/build_binary -T -s zh/thu.arpa  zh/thu.bin
Reading zh/thu.arpa
----5---10---15---20---25---30---35---40---45---50---55---60---65---70---75---80---85---90---95--100
****************************************************************************************************
SUCCESS

Then use native _client/generate_trie to generate the trie:

get the pre-built binary: python3 util/taskcluster.py --arch gpu --target ./native_client and ./native_client/generate_trie ~/lm/zh/thchs30-alphabet.txt ~/lm/zh/thu.bin ~/lm/zh/trie

then I got: [1] 27569 segmentation fault ./native_client/generate_trie ~/lm/zh/thu.bin ~/lm/zh/trie

BTW, I tried a english bible corpus, everything worked well.

lissyx commented 6 years ago

I guess the problem is bin file generated by KenLM. @lissyx

Sure, but no stack trace, nothing we can work on so far. Can you share more informations ?

cailurus commented 6 years ago

@lissyx I found an interesting thing. I added an english word in my Chinese vocab file, like this: https://github.com/ailurus1991/ds_files/blob/master/thchs30-vocabulary.txt

no seg fault anymore, so the problem is the encoding or something?

lissyx commented 5 years ago

no seg fault anymore, so the problem is the encoding or something?

Encoding, I doubt, but if you can try to investigate a bit more or share better STR and content for us to reproduce, we could fix that, maybe ?

x0day commented 5 years ago

@lissyx I meet the same error here.

  1. vocabulary.txt
    我
    你 我
    他
    我
    你
  2. alphabet2.txt
    我
    你
    他
lmplz -o 2 --text vocabulary.txt --arpa words.arpa
build_binary -T -s words.arpa  lm.binary
generate_trie alphabet.txt lm.binary trie

last I meet the same error.

439388 segmentation fault  ~/native_client/generate_trie alphabet.txt lm.binary trie

then I run this command in gdb and get some debug info.

Program received signal SIGSEGV, Segmentation fault.
0x0000000000412431 in Scorer::save_dictionary(std::string const&) ()
lissyx commented 5 years ago

@x0day Can you share your files so we can investigate ?

x0day commented 5 years ago

@lissyx what files?

you can try with vocabulary.txt and alphabet.txt I uploaded. and download the native client with command python util/taskcluster.py --branch v0.4.0-alpha.3 --target ~/native_client/ and the latest kenlm.

alphabet.txt vocabulary.txt

lissyx commented 5 years ago

@lissyx what files?

you can try with vocabulary.txt and alphabet.txt I uploaded. and download the native client with command python util/taskcluster.py --branch v0.4.0-alpha.3 --target ~/native_client/ and the latest kenlm.

alphabet.txt vocabulary.txt

I would also like to have yours lm.binary and trie

lissyx commented 5 years ago

I'm not even able to reproduce your lm.binary:

alexandre@serveur:~/tmp/KenLM/issue1756$ ../kenlm-build/bin/lmplz -o 2 --text vocabulary.issue1756.txt --arpa words.arpa
=== 1/5 Counting and sorting n-grams ===
Reading /home/alexandre/tmp/KenLM/issue1756/vocabulary.issue1756.txt
----5---10---15---20---25---30---35---40---45---50---55---60---65---70---75---80---85---90---95--100
****************************************************************************************************
Unigram tokens 6 types 6
=== 2/5 Calculating and sorting adjusted counts ===
Chain sizes: 1:72 2:53974548480
/home/alexandre/tmp/KenLM/kenlm/lm/builder/adjust_counts.cc:52 in void lm::builder::{anonymous}::StatCollector::CalculateDiscounts(const lm::builder::DiscountConfig&) threw BadDiscountException because `s.n[j] == 0'.
Could not calculate Kneser-Ney discounts for 1-grams with adjusted count 4 because we didn't observe any 1-grams with adjusted count 3; Is this small or artificial data?
Try deduplicating the input.  To override this error for e.g. a class-based model, rerun with --discount_fallback

Abandon

Ok, one need --discount_fallback as well.

lissyx commented 5 years ago

lmplz -o 2 --text vocabulary.txt --arpa words.arpa build_binary -T -s words.arpa lm.binary

This is just plain wrong, please checkout the documentation on how to rebuild a language model: https://github.com/mozilla/DeepSpeech/blob/master/data/lm/README.md

lissyx commented 5 years ago

lmplz -o 2 --text vocabulary.txt --arpa words.arpa build_binary -T -s words.arpa lm.binary

This is just plain wrong, please checkout the documentation on how to rebuild a language model: https://github.com/mozilla/DeepSpeech/blob/master/data/lm/README.md

Ok, with proper arguments for build_binary I reproduce the segfault, and at the same place.

x0day commented 5 years ago

@lissyx I can't generate the trie file. you can check the lm.binary and the arpa file.

arpa_and_binary.tar.gz

reuben commented 5 years ago

This should be fixed with the latest master. Please reopen this issue if you still see the bug with the new code.

m13225311263 commented 5 years ago

@lissyx Hi, In the lastest version v0.4.1. I used the vocabulary and the alphabet from @ailurus1991 to test. https://github.com/ailurus1991/ds_files/blob/master/thchs30-vocabulary.txt https://github.com/ailurus1991/ds_files/blob/master/thchs30-alphabet.txt

The trie file is generated. But, I do not think this file is right. Because the size of this file is too small, only 45KB.

Is it an another bug?

lissyx commented 5 years ago

@lissyx Hi, In the lastest version v0.4.1. I used the vocabulary and the alphabet from @ailurus1991 to test. https://github.com/ailurus1991/ds_files/blob/master/thchs30-vocabulary.txt https://github.com/ailurus1991/ds_files/blob/master/thchs30-alphabet.txt

The trie file is generated. But, I do not think this file is right. Because the size of this file is too small, only 45KB.

Is it an another bug?

That seems pretty obvious that this is not a bug. When working on that I got similar small files, I think it's fine.

lock[bot] commented 5 years ago

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.