hplt-project / sacremoses

Python port of Moses tokenizer, truecaser and normalizer
MIT License
486 stars 59 forks source link

Encoding problem when calling from python script in Windows #28

Closed mlforcada closed 5 years ago

mlforcada commented 5 years ago

Hi, @alvations. A student of mine and I are using the truecaser from a Python 3 script in windows. The script makes sure that all files are opened in utf-8 by redefining open as follows:

open = functools.partial(open, encoding='utf8') 

However, when the script executes this excerpt of code:

    for suffix in [args.l1, args.l2] :
        mtc = MosesTruecaser()
        mtc.train([line.split() for line in open("train."+changes_applied+suffix)], save_to="truecasemodel."+suffix)
        for prefix in ["train.", "dev.", "test."] :
            with open(prefix+changes_applied+"true."+suffix,"w") as outfile :
                [outfile.write(mtc.truecase(line, return_str=True)+"\n") for line in open(prefix+changes_applied+suffix)]

The error occurs in the line

[outfile.write(mtc.truecase(line, return_str=True)+"\n") for line in open(prefix+changes_applied+suffix)]

And the traceback is

(prueba) C:\Users\Guest\nmt-for-translators\globalvoices\data>python ../../code/prepare.py GlobalVoices.es-fr fr es 130000 1500 1500 10000 --tokenize --truecase --bpe
Ficheros 'tokenizados'
Traceback (most recent call last):
  File "../../code/prepare.py", line 132, in <module>
    mtc.train([line.split() for line in open("train."+changes_applied+suffix)], save_to="truecasemodel."+suffix)
  File "C:\Users\Guest\AppData\Local\Programs\Python\Python36\lib\site-packages\sacremoses\truecase.py", line 142, in train
    self.model = self._train(documents, save_to, possibly_use_first_token, processes, progress_bar=progress_bar)
  File "C:\Users\Guest\AppData\Local\Programs\Python\Python36\lib\site-packages\sacremoses\truecase.py", line 132, in _train
    self._save_model_from_casing(casing, save_to)
  File "C:\Users\Guest\AppData\Local\Programs\Python\Python36\lib\site-packages\sacremoses\truecase.py", line 324, in _save_model_from_casing
    c
  File "C:\Users\Guest\prueba\lib\encodings\cp1252.py", line 19, in encode
    return codecs.charmap_encode(input,self.errors,encoding_table)[0]
UnicodeEncodeError: 'charmap' codec can't encode character '\u011f' in position 4: character maps to <undefined>

We found that '\u011f' is the character 'ğ' in 'Erdoğan', position 4 in the word. Clearly, the error occurs when printing to a file

print(' '.join(tokens_counts), end='\n', file=fout)

The file fout is opened as follows in truecase.py

with open(filename, 'w') as fout:

We have tried setting PYTHONIOENCODING before calling the script, to no avail.

Our python is:

Python 3.6.5 (v3.6.5:f59c0932b4, Mar 28 2018, 17:00:18) [MSC v.1900 64 bit (AMD64)] on win32
Type "help", "copyright", "credits" or "license" for more information.

and it was installed from This page. It also fails with 3.6.3 on a different machine. We are using it inside a virtualenv. We don't seem to find a way to solve this without modifying your code. Thanks a million for your help.

alvations commented 5 years ago

Yes, I also got that from my students using Windows. It seems to be fixed in Python 3.6 https://stackoverflow.com/a/32176732/610569 but from what I see there's still some issues. I would suggest that the encoding be added as one of the parameters for the different functions and utf-8 as the default.

Patching in a while...

alvations commented 5 years ago

@mlforcada Please try to upgrade the version of pip install -U sacremoses>=0.0.8, the windows version should work fine too, I've added the .appveyor.yml test on Windows systems (just in case).

mlforcada commented 5 years ago

Thanks a million, @alvations! It works as a charm. I'll check what you did and learn from it!

alvations commented 5 years ago

Great that it works now! Thanks @mlforcada for raising the issue =)