moses-smt / mgiza

A word alignment tool based on famous GIZA++, extended to support multi-threading, resume training and incremental training.
161 stars 60 forks source link

Problem with building probabilistic dictionaries #21

Closed Syrkovski closed 4 years ago

Syrkovski commented 4 years ago

Hello, I tried to build probabilistic dictionaries (I need it for training Becleaner model), but as a result I get something like:

afterwards NULL 0.0000124 pension NULL 0.0000372 truss NULL 0.0000124 birthday NULL 0.0000744 commemorate NULL 0.0000248

Entire second column is "NULL"

The command I used is: mosesdecoder/scripts/training/train-model.perl --alignment grow-diag-final-and --root-dir bicleaner_inf/ --corpus bicleaner_inf/corpus.clean --e en --f zh --mgiza -mgiza-cpus 8 --parallel --first-step 1 --last-step 4 --external-bin-dir mgiza/mgizapp/bin/

It looks like major error occurs in mgiza:

Merging A3.final.part tables Executing: enchmodels/mgiza/mgizapp/bin/merge_alignment.py enchmodels/bicleaner_inf/giza.zh-en/zh-en.A3.final.part> enchmodels/bicleaner_inf/giza.zh-en/zh-en.A3.final Traceback (most recent call last): File "enchmodels/mgiza/mgizapp/bin/merge_alignment.py", line 32, in st1 = files[i].readline(); File "/usr/lib/python3.5/encodings/ascii.py", line 26, in decode return codecs.ascii_decode(input, self.errors)[0] UnicodeDecodeError: 'ascii' codec can't decode byte 0xe5 in position 84: ordinal not in range(128) Exit code: 1

And after it gives the whole chunk of errors like:

Use of uninitialized value $a in scalar chomp at enchmodels/mosesdecoder/scripts/training/LexicalTranslationModel.pm line 105 Use of uninitialized value in substitution (s///) at enchmodels/mosesdecoder/scripts/training/LexicalTranslationModel.pm line 40.

Syrkovski commented 4 years ago

Solved this problem

JOHW85 commented 2 years ago

Solved this problem

Seems like the best way is to recompile MGIZA I used the instructions here: https://hovinh.github.io/blog/2016-04-29-install-mgiza-ubuntu/