Open BirgerMoell opened 2 years ago
Ah I think I forgot to write that you also need to install this library here: https://github.com/kpu/kenlm#installation
Can you give the pip install command a try and see whether it works? :-)
In case it works it would be amazing if you could make a quick PR to update the requirements.txt
and the README.md :-)
I installed using pip install https://github.com/kpu/kenlm/archive/master.zip I'm getting a new error.
My guess is that my polish.arpa file is misformed somehow but it's quite tricky to check since it's very slow to load and edit the file.
Here is how the beginning of the file looks. Since the error said it's expecting a tab, i suspect that there might be spaces somewhere where it should be tabs in the file?
\data\
ngram 1=86587
ngram 2=546387
ngram 3=796581
ngram 4=843999
ngram 5=850874
\1-grams:
-5.7532206 <unk> 0
0 <s> -0.06677356
0 </s> -0.06677356
Traceback (most recent call last):
File "./eval.py", line 100, in <module>
main(args)
File "./eval.py", line 37, in main
args.path_to_ngram,
File "/home/bmoell/miniconda3/envs/wav2vec-nlp/lib/python3.7/site-packages/pyctcdecode/decoder.py", line 697, in build_ctcdecoder
kenlm_model = None if kenlm_model_path is None else kenlm.Model(kenlm_model_path)
File "kenlm.pyx", line 142, in kenlm.Model.__init__
OSError: Cannot read model 'polish.arpa' (lm/read_arpa.hh:51 in void lm::Read1Gram(util::FilePiece&, Voc&, Weights*, lm::PositiveProbWarn&) [with Voc = lm::ngram::ProbingVocabulary; Weights = lm::ProbBackoff] threw FormatLoadException because `f.get() != '\t''. Expected tab after probability in the 1-gram at byte 103 Byte: 103)
yeah this looks like an issue with the .arpa
file - a good debugging strategy would be to:
</s>
-> quicker debugging cycle -> find bug -> correct -> apply same to large 5 gram.Don't think this is related to the code here
@patrickvonplaten I believe the FormatLoadException error occurs when the file is changed (while following the instructions) via editor (vim, nano, even IDEs). They commonly mishandle the '\t', '\n' & other alike indentations.
Easiest way to apply your instructions without breaking the formatting would be:
original = open('lang.arpa', 'r').readlines()
fixed = open('lang_fixed.arpa', 'w')
for line in original:
if line == 'ngram 1=634704\n':
fixed.write('ngram 1=634705\n')
elif line == '0\t\t-0.07692495\n':
fixed.write('0\t\t-0.07692495\n')
fixed.write('0\t\t-0.07692495\n')
else:
fixed.write(line)
fixed.close()
I followed the tutorial and installed kenlm inside the subfolder kenlm and created the polish.arpa file and updated it.
I get the following error when running ./eval.py --language polish --path_to_ngram polish.arpa I'm running python 3.7 using conda on and ubuntu machine with GPU I manually installed pyctcdecode since it wasn't included in the requirements file.