rsennrich / subword-nmt

Unsupervised Word Segmentation for Neural Machine Translation and Text Generation
MIT License
2.18k stars 464 forks source link

UnicodeDecodeError: fairseq-interactive example #86

Closed isVoid closed 4 years ago

isVoid commented 4 years ago

I ran the fairseq-interactive from fairseq here. The environment is Windows 10 18363, Powershell 5.1.18362.628. Python version is 3.7.5.

The interactive executable failed to launch with

Traceback (most recent call last):
  File "C:\Users\micha\ptenv\Scripts\fairseq-interactive-script.py", line 11, in <module>
    load_entry_point('fairseq==0.9.0', 'console_scripts', 'fairseq-interactive')()
  File "C:\Users\micha\ptenv\lib\site-packages\fairseq_cli\interactive.py", line 190, in cli_main
    main(args)
  File "C:\Users\micha\ptenv\lib\site-packages\fairseq_cli\interactive.py", line 105, in main
    bpe = encoders.build_bpe(args)
  File "C:\Users\micha\ptenv\lib\site-packages\fairseq\registry.py", line 41, in build_x
    return builder(args, *extra_args, **extra_kwargs)
  File "C:\Users\micha\ptenv\lib\site-packages\fairseq\data\encoders\subword_nmt_bpe.py", line 39, in __init__
    bpe_args.glossaries,
  File "C:\Users\micha\ptenv\lib\site-packages\subword_nmt\apply_bpe.py", line 47, in __init__
    self.bpe_codes = [tuple(item.strip('\r\n ').split(' ')) for (n, item) in enumerate(codes) if (n < merges or merges == -1)]
  File "C:\Users\micha\ptenv\lib\site-packages\subword_nmt\apply_bpe.py", line 47, in <listcomp>
    self.bpe_codes = [tuple(item.strip('\r\n ').split(' ')) for (n, item) in enumerate(codes) if (n < merges or merges == -1)]
  File "c:\users\micha\appdata\local\programs\python\python37\lib\encodings\cp1252.py", line 23, in decode
    return codecs.charmap_decode(input,self.errors,decoding_table)[0]

Further investigation showed that arg_parse in apply_bpe::create_parser use "cp1252" encoding if not specified in argparse.FileType() under my environment. Adding encoding='utf-8' to the function in the line of --codes temporarily fix the problem. But this is very hacky.

rsennrich commented 4 years ago

hm, I cannot reproduce this (on Linux), even when changing my locale to something that isn't UTF-8.

As a quick note: your powershell uses cp1252 as default encoding, but you're trying to work with UTF-8 encoded files. You can probably work around the problem by changing the encoding in your powershell, or upgrading to powershell 6, which uses UTF-8 by default.

rsennrich commented 4 years ago

Myle Ott also commented elsewhere that fairseq expects UTF-8 throughout: https://github.com/pytorch/fairseq/issues/1287#issuecomment-566270467

I'm closing this, since this issue is ultimately caused by running fairseq in a non-supported environment.