Open adamnsandle opened 5 years ago
@SeanNaren would be really great if you could help out with this
@adamnsandle Which tokens do you use, how much data do use to train the model, how do you train the KenLM model, how do you initialize the class?
What we tried to do:
Overall, RAM consumption drops with lesser numer of sentences / their length, but stays too huge, ~20 GB with a 100 MB model
Labels used - 'АБВГДЕЁЖЗИЙКЛМНОПРСТУФХЦЧШЩЪЫЬЭЮЯ2_ * '
(simple russian alphabet, 2 is a special symbol for a letter repeat, * as a string end symbol)
ctc class inizialization:
dcdr = CTCBeamDecoder(labels, lm_path, alpha=0.3, beta=0.4, cutoff_top_n=20, cutoff_prob=1, beam_width=100, num_processes=6, labels.index('_'))
(tried different num_processes and beam_width, did not work)
Did you guys try turning this arpa file into a trie file? check the examples here
nvm just saw the above comment. You should always build a binary realistically, using the raw ARPA probably is overkill.
Also you should definitely check pruning specific ngram when making the LM if you guys haven't done so already. This can be seen here
@adamnsandle Also which kenlm command did you use to train an LM?
@SeanNaren Which command do you use for your models?
@SeanNaren
We used this script to train model:
bin/lmplz -o 4 -S 50% -T temp/ --prune 0 30 60 130 --discount_fallback <web_all_norm.txt> web_all_norm.arpa
And to binarize it:
./build_binary -S 5G trie web_all_norm.arpa web_all_norm.arpa.bin
or
./build_binary -S 5G trie -q 8 web_all_norm.arpa web_all_norm.arpa.bin
or simply
./build_binary web_all_norm.arpa web_all_norm.arpa.bin
How big is the output trie?
2.05 GB
When we try to use lesser model (~200 MB in trie), RAM leak still presents (~30GB)
Hi, I want to know that what does the KenLM model based? Word-based or character-based??? Thank you very much!@adamnsandle @SeanNaren
+1 this issue. Even using the deep speech 1 lm binary causes massive ram use
I believe this is caused by an internal trie creation on model loading, which then stays in memory and consumes a lot of RAM. Mozilla in their version saves this trie to an external file, and doesn't generate it "on the fly".
Loading model...
Traceback (most recent call last):
File "examples/demo-server.py", line 10, in
i have the simillar problems on isuue #137
+1 this issue.
+1. Is there a fix?
Hello! For some reason our 3 GB russian KenLM arpa model (binarized) uses ~50 GB of RAM during CTCBeamDecoder class inizialization and estimation (100 beam width). When using KenLM python module with this model everything is ok! Model was trained on a big Russian corpus (37 labels).