Transformer validation data is too small

marian-nmt / marian-examples

Examples, tutorials and use cases for Marian, including our WMT-2017/18 baselines.

Other

78 stars 34 forks source link

Transformer validation data is too small #1

Closed afaji closed 6 years ago

afaji commented 6 years ago

I'm running ./run-me.sh in transformer example and it seems that the validation set is weirdly too small.

[cs-aji1@login-e-3 transformer]$ wc -l data/valid.*
    1 data/valid.bpe.de
   70 data/valid.bpe.en
    0 data/valid.bpe.en.output
    1 data/valid.de
   70 data/valid.en
    1 data/valid.tc.de
   70 data/valid.tc.en
    1 data/valid.tok.de
   70 data/valid.tok.en
  284 total

I'm using this sacreBLEU https://github.com/mjpost/sacreBLEU/tree/master

emjotde commented 6 years ago

maybe sacrebleu failed, remove ~/.sacrebleu and try again.

emjotde commented 6 years ago

Also I think the examples is downloading scacrebleu by itself, if you do things manually you are on your own :)

snukky commented 6 years ago

Files data/valid.{de,en} are downloaded automatically by sacreBLEU in the run-me.sh script and they should be 3k lines long. I tested that and the example works for me.

@afaji what about your data/test201?.en files? Those are also downloaded by sacreBLEU.

afaji commented 6 years ago

I redo everything, remove ~/.sacrebleu, using the automatically downloaded sacrebleu as well.

it seems that I got UnicodeEncodeError: 'ascii' codec can't encode character '\xe9' in position 30: ordinal not in range(128) error while executing:

LC_ALL=C.UTF-8 ../tools/sacreBLEU/sacrebleu.py -t wmt13 -l en-de --echo src > data/valid.en

I'm running this on wilkes cluster. This apparently worked fine on valhalla.

emjotde commented 6 years ago

LC_ALL=C.UTF-8 is supposed to fix that. Matt says that the accent which caused these issues has been removed with the newest version of sacrebleu.

kpu commented 6 years ago

Angus recommends PYTHONIOENCODING=utf-8 as working on shudder CentOS as well as Ubuntu.

emjotde commented 6 years ago

What's the status of this?