patrickvonplaten / Wav2Vec2_PyCTCDecode

Small repo describing how to use Hugging Face's Wav2Vec2 with PyCTCDecode
109 stars 17 forks source link

Error when running eval script #1

Open BirgerMoell opened 2 years ago

BirgerMoell commented 2 years ago

I followed the tutorial and installed kenlm inside the subfolder kenlm and created the polish.arpa file and updated it.

I get the following error when running ./eval.py --language polish --path_to_ngram polish.arpa I'm running python 3.7 using conda on and ubuntu machine with GPU I manually installed pyctcdecode since it wasn't included in the requirements file.

./eval.py --language polish --path_to_ngram polish.arpa
Traceback (most recent call last):
  File "./eval.py", line 8, in <module>
    from pyctcdecode import build_ctcdecoder
  File "/home/bmoell/miniconda3/envs/wav2vec-nlp/lib/python3.7/site-packages/pyctcdecode/__init__.py", line 3, in <module>
    from .decoder import BeamSearchDecoderCTC, build_ctcdecoder  # noqa
  File "/home/bmoell/miniconda3/envs/wav2vec-nlp/lib/python3.7/site-packages/pyctcdecode/decoder.py", line 26, in <module>
    from .language_model import (
  File "/home/bmoell/miniconda3/envs/wav2vec-nlp/lib/python3.7/site-packages/pyctcdecode/language_model.py", line 55, in <module>
    def _prepare_unigram_set(unigrams: Collection[str], kenlm_model: kenlm.Model) -> Set[str]:
AttributeError: module 'kenlm' has no attribute 'Model'
patrickvonplaten commented 2 years ago

Ah I think I forgot to write that you also need to install this library here: https://github.com/kpu/kenlm#installation

Can you give the pip install command a try and see whether it works? :-)

patrickvonplaten commented 2 years ago

In case it works it would be amazing if you could make a quick PR to update the requirements.txt and the README.md :-)

BirgerMoell commented 2 years ago

I installed using pip install https://github.com/kpu/kenlm/archive/master.zip I'm getting a new error.

My guess is that my polish.arpa file is misformed somehow but it's quite tricky to check since it's very slow to load and edit the file.

Here is how the beginning of the file looks. Since the error said it's expecting a tab, i suspect that there might be spaces somewhere where it should be tabs in the file?

\data\
ngram 1=86587
ngram 2=546387
ngram 3=796581
ngram 4=843999
ngram 5=850874

\1-grams:
-5.7532206      <unk>   0
0       <s>     -0.06677356
0       </s>     -0.06677356
Traceback (most recent call last):
  File "./eval.py", line 100, in <module>
    main(args)
  File "./eval.py", line 37, in main
    args.path_to_ngram,
  File "/home/bmoell/miniconda3/envs/wav2vec-nlp/lib/python3.7/site-packages/pyctcdecode/decoder.py", line 697, in build_ctcdecoder
    kenlm_model = None if kenlm_model_path is None else kenlm.Model(kenlm_model_path)
  File "kenlm.pyx", line 142, in kenlm.Model.__init__
OSError: Cannot read model 'polish.arpa' (lm/read_arpa.hh:51 in void lm::Read1Gram(util::FilePiece&, Voc&, Weights*, lm::PositiveProbWarn&) [with Voc = lm::ngram::ProbingVocabulary; Weights = lm::ProbBackoff] threw FormatLoadException because `f.get() != '\t''. Expected tab after probability in the 1-gram at byte 103 Byte: 103)
patrickvonplaten commented 2 years ago

yeah this looks like an issue with the .arpa file - a good debugging strategy would be to:

Don't think this is related to the code here

deepconsc commented 2 years ago

@patrickvonplaten I believe the FormatLoadException error occurs when the file is changed (while following the instructions) via editor (vim, nano, even IDEs). They commonly mishandle the '\t', '\n' & other alike indentations.

Easiest way to apply your instructions without breaking the formatting would be:

  1. Read the arpa via python & copy the 2 lines that needs to be changed.
  2. Write a new arpa file by changing the lines while handling the formatting correctly like the script shows below.
  3. Load it & have fun.
    
    original = open('lang.arpa', 'r').readlines()
    fixed = open('lang_fixed.arpa', 'w')

for line in original: if line == 'ngram 1=634704\n': fixed.write('ngram 1=634705\n') elif line == '0\t\t-0.07692495\n': fixed.write('0\t\t-0.07692495\n') fixed.write('0\t\t-0.07692495\n') else: fixed.write(line) fixed.close()