kmario23 / KenLM-training

Training an n-gram based Language Model using KenLM toolkit for Deep Speech 2
112 stars 21 forks source link

How to generate trie file? #4

Open EuphoriaCelestial opened 3 years ago

EuphoriaCelestial commented 3 years ago

Hi, I have successful run all those steps in README and have bible.arpa bible.binary but there is no trie file How can I generate trie? I cant find any tutorial about this

kmario23 commented 3 years ago

Hey @EuphoriaCelestial, trie is a data structure that's used when binarizing the model. Please have a look here for more info: kenlm/data-structures.

So, just using the trie switch should solve the issue.

EuphoriaCelestial commented 3 years ago

Hey @EuphoriaCelestial, trie is a data structure that's used when binarizing the model. Please have a look here for more info: kenlm/data-structures.

So, just using the trie switch should solve the issue.

I have tried this command kenlm/bin/build_binary -T /tmp/trie -S 1G trie bible.arpa bible.binary but get this error everytime

Reading bible.arpa
----5---10---15---20---25---30---35---40---45---50---55---60---65---70---75---80---85---90---95--100
Segmentation fault (core dumped)
kmario23 commented 3 years ago

This seems to be a recurring issue. C.f. kenlm/issues/248, /letter-based-language-model/33986

Some suggestions:

EuphoriaCelestial commented 3 years ago

This seems to be a recurring issue. C.f. kenlm/issues/248, /letter-based-language-model/33986

Some suggestions:

* there's a [discourse forum for DeepSpeech related issues](https://discourse.mozilla.org/c/mozilla-voice-stt/247) to get help from.

* recheck the (correct installation of all) dependencies. Or reinstall kenlm. Boost libs might cause issues.

* Segmentation fault (core dumped) is a C/C++ issue. Seems to me that there's something wrong with the `.arpa` file.

I have tried clean install on another machine with better specs (i7, 32gb RAM, 2080ti) but still got the same error the .arpa file seem good ... I think so because I can use it to score sentences normally, it give the correct score with the example in README