jadore801120 / attention-is-all-you-need-pytorch

A PyTorch implementation of the Transformer model in "Attention is All You Need".
MIT License
8.77k stars 1.97k forks source link

Preprocessing Error #41

Closed karanchahal closed 4 years ago

karanchahal commented 6 years ago

On running the following command for preprocessing for l in en de; do for f in data/multi30k/*.$l; do if [[ "$f" != *"test"* ]]; then sed -i "$ d" $f; fi; done; done; I'm getting the following error sed: 1: "data/multi30k/train.en": extra characters at the end of d command sed: 1: "data/multi30k/val.en": extra characters at the end of d command sed: 1: "data/multi30k/train.de": extra characters at the end of d command sed: 1: "data/multi30k/val.de": extra characters at the end of d command

Please advice as to how I should proceed

zhangdistephen commented 6 years ago

I guess you use MacOS. Just replace sed -i "$ d" with sed -i '' "$ d", then it will work. Here is the reason MacOS seq question.

sankuniu commented 6 years ago

Thank you for attention! My workbench include ubuntu16.04, pytorch 0.5, python 3.6. But there are errors when I preprocess the data. sanku@ubuntu:~/apytorch$ for l in en de; do for f in data/multi30k/.$l; do if [[ "$f" != "test" ]]; then sed -i "$ d" $f; fi; done; done sanku@ubuntu:~/apytorch$ for l in en de; do for f in data/multi30k/.$l; do perl tokenizer.perl -a -no-escape -l $l -q < $f > $f.atok; done; done Can't open perl script "tokenizer.perl": No such file or directory Can't open perl script "tokenizer.perl": No such file or directory Can't open perl script "tokenizer.perl": No such file or directory Can't open perl script "tokenizer.perl": No such file or directory Can't open perl script "tokenizer.perl": No such file or directory Can't open perl script "tokenizer.perl": No such file or directory

whikwon commented 6 years ago

@sankuniu You should execute below.

wget https://raw.githubusercontent.com/moses-smt/mosesdecoder/master/scripts/tokenizer/tokenizer.perl
wget https://raw.githubusercontent.com/moses-smt/mosesdecoder/master/scripts/share/nonbreaking_prefixes/nonbreaking_prefix.de
wget https://raw.githubusercontent.com/moses-smt/mosesdecoder/master/scripts/share/nonbreaking_prefixes/nonbreaking_prefix.en
sed -i "s/$RealBin\/..\/share\/nonbreaking_prefixes//" tokenizer.perl
wget https://raw.githubusercontent.com/moses-smt/mosesdecoder/master/scripts/generic/multi-bleu.perl