mit-han-lab / lite-transformer

[ICLR 2020] Lite Transformer with Long-Short Range Attention
https://arxiv.org/abs/2004.11886
Other
596 stars 81 forks source link

wmt14 en-fr data processing problem #16

Closed macn3388 closed 3 years ago

macn3388 commented 3 years ago

bash: /opt/tiger/conda/lib/libtinfo.so.6: no version information available (required by bash) Cloning Moses github repository (for tokenization scripts)... fatal: destination path 'mosesdecoder' already exists and is not an empty directory. Cloning Subword NMT repository (for BPE pre-processing)... fatal: destination path 'subword-nmt' already exists and is not an empty directory. training-parallel-europarl-v7.tgz already exists, skipping download training-parallel-commoncrawl.tgz already exists, skipping download training-parallel-un.tgz already exists, skipping download training-parallel-nc-v9.tgz already exists, skipping download training-giga-fren.tar already exists, skipping download test-full.tgz already exists, skipping download gzip: giga-fren.release2.fixed.*.gz: No such file or directory /home/tiger/lite-transformer pre-processing train data... rm: cannot remove 'data/wmt14_en_fr/wmt14.tokenized.en-fr/tmp/train.tags.en-fr.tok.en': No such file or directory Tokenizer Version 1.1 Language: en Number of threads: 8 Tokenizer Version 1.1 Language: en Number of threads: 8 Tokenizer Version 1.1 Language: en Number of threads: 8 Tokenizer Version 1.1 Language: en Number of threads: 8 Tokenizer Version 1.1 Language: en Number of threads: 8 rm: cannot remove 'data/wmt14_en_fr/wmt14.tokenized.en-fr/tmp/train.tags.en-fr.tok.fr': No such file or directory Tokenizer Version 1.1 Language: fr Number of threads: 8 Tokenizer Version 1.1 Language: fr Number of threads: 8 Tokenizer Version 1.1 Language: fr Number of threads: 8 Tokenizer Version 1.1 Language: fr Number of threads: 8 Tokenizer Version 1.1 Language: fr Number of threads: 8 pre-processing test data... Tokenizer Version 1.1 Language: en Number of threads: 8

Tokenizer Version 1.1 Language: fr Number of threads: 8

splitting train and valid... learn_bpe.py on data/wmt14_en_fr/wmt14.tokenized.en-fr/tmp/train.fr-en... apply_bpe.py to train.en... subword-nmt/apply_bpe.py:416: ResourceWarning: unclosed file <_io.TextIOWrapper name='data/wmt14_en_fr/wmt14.tokenized.en-fr/code' mode='r' encoding='UTF-8'> args.codes = codecs.open(args.codes.name, encoding='utf-8') ResourceWarning: Enable tracemalloc to get the object allocation traceback apply_bpe.py to valid.en... subword-nmt/apply_bpe.py:416: ResourceWarning: unclosed file <_io.TextIOWrapper name='data/wmt14_en_fr/wmt14.tokenized.en-fr/code' mode='r' encoding='UTF-8'> args.codes = codecs.open(args.codes.name, encoding='utf-8') ResourceWarning: Enable tracemalloc to get the object allocation traceback apply_bpe.py to test.en... subword-nmt/apply_bpe.py:416: ResourceWarning: unclosed file <_io.TextIOWrapper name='data/wmt14_en_fr/wmt14.tokenized.en-fr/code' mode='r' encoding='UTF-8'> args.codes = codecs.open(args.codes.name, encoding='utf-8') ResourceWarning: Enable tracemalloc to get the object allocation traceback apply_bpe.py to train.fr... subword-nmt/apply_bpe.py:416: ResourceWarning: unclosed file <_io.TextIOWrapper name='data/wmt14_en_fr/wmt14.tokenized.en-fr/code' mode='r' encoding='UTF-8'> args.codes = codecs.open(args.codes.name, encoding='utf-8') ResourceWarning: Enable tracemalloc to get the object allocation traceback apply_bpe.py to valid.fr... subword-nmt/apply_bpe.py:416: ResourceWarning: unclosed file <_io.TextIOWrapper name='data/wmt14_en_fr/wmt14.tokenized.en-fr/code' mode='r' encoding='UTF-8'> args.codes = codecs.open(args.codes.name, encoding='utf-8') ResourceWarning: Enable tracemalloc to get the object allocation traceback apply_bpe.py to test.fr... subword-nmt/apply_bpe.py:416: ResourceWarning: unclosed file <_io.TextIOWrapper name='data/wmt14_en_fr/wmt14.tokenized.en-fr/code' mode='r' encoding='UTF-8'> args.codes = codecs.open(args.codes.name, encoding='utf-8') ResourceWarning: Enable tracemalloc to get the object allocation traceback clean-corpus.perl: processing data/wmt14_en_fr/wmt14.tokenized.en-fr/tmp/bpe.train.en & .fr to data/wmt14_en_fr/wmt14.tokenized.en-fr/train, cutoff 1-250, ratio 1.5 ..........(100000)..........(200000)..........(300000)..........(400000)..........(500000)..........(600000)..........(700000)..........(800000)..........(900000)..........(1000000)..........(1100000)..........(1200000)..........(1300000)..........(1400000)..........(1500000)..........(1600000)..........(1700000)..........(1800000)..........(1900000)..........(2000000)..........(2100000)..........(2200000)..........(2300000)..........(2400000)..........(2500000)..........(2600000)..........(2700000)..........(2800000)..........(2900000)..........(3000000)..........(3100000)..........(3200000)..........(3300000)..........(3400000)..........(3500000)..........(3600000)..........(3700000)..........(3800000)..........(3900000)..........(4000000)..........(4100000)..........(4200000)..........(4300000)..........(4400000)..........(4500000)..........(4600000)..........(4700000)..........(4800000)..........(4900000)..........(5000000)..........(5100000)..........(5200000)..........(5300000)..........(5400000)..........(5500000)..........(5600000)..........(5700000)..........(5800000)..........(5900000)..........(6000000)..........(6100000)..........(6200000)..........(6300000)..........(6400000)..........(6500000)..........(6600000)..........(6700000)..........(6800000)..........(6900000)..........(7000000)..........(7100000)..........(7200000)..........(7300000)..........(7400000)..........(7500000)..........(7600000)..........(7700000)..........(7800000)..........(7900000)..........(8000000)..........(8100000)..........(8200000)..........(8300000)..........(8400000)..........(8500000)..........(8600000)..........(8700000)..........(8800000)..........(8900000)..........(9000000)..........(9100000)..........(9200000)..........(9300000)..........(9400000)..........(9500000)..........(9600000)..........(9700000)..........(9800000)..........(9900000)..........(10000000)..........(10100000)..........(10200000)..........(10300000)..........(10400000)..........(10500000)..........(10600000)..........(10700000)..........(10800000)..........(10900000)..........(11000000)..........(11100000)..........(11200000)..........(11300000)..........(11400000)..........(11500000)..........(11600000)..........(11700000)..........(11800000)..........(11900000)..........(12000000)..........(12100000)..........(12200000)..........(12300000)..........(12400000)..........(12500000)..........(12600000)..........(12700000)..........(12800000)..........(12900000)..........(13000000)..........(13100000)..........(13200000)..........(13300000)..........(13400000)..........(13500000)..........(13600000)..........(13700000)..........(13800000)..........(13900000)..........(14000000)..........(14100000)..........(14200000)..........(14300000)..........(14400000)..........(14500000)..........(14600000)..........(14700000)..........(14800000)..........(14900000)..........(15000000)..........(15100000)..........(15200000)..........(15300000)..........(15400000)..........(15500000)..........(15600000)..........(15700000)..........(15800000)..........(15900000)..........(16000000)..........(16100000)..........(16200000)..........(16300000)..........(16400000)..........(16500000)..........(16600000)..........(16700000)..........(16800000)..........(16900000)..........(17000000)..........(17100000)..........(17200000)..........(17300000)..........(17400000)..........(17500000)..........(17600000)..........(17700000)..........(17800000)..........(17900000)..........(18000000)..........(18100000)..........(18200000)..........(18300000)..........(18400000)..........(18500000)..........(18600000)..........(18700000)..........(18800000)..........(18900000)..........(19000000)..........(19100000)..........(19200000)..........(19300000)..........(19400000)..........(19500000)..........(19600000)..........(19700000)..........(19800000)..........(19900000)..........(20000000)..........(20100000)..........(20200000)..........(20300000)..........(20400000)..........(20500000)..........(20600000)..........(20700000)..........(20800000)..........(20900000)..........(21000000)..........(21100000)..........(21200000)..........(21300000)..........(21400000)..........(21500000)..........(21600000)..........(21700000)..........(21800000)..........(21900000)..........(22000000)..........(22100000)..........(22200000)..........(22300000)..........(22400000)..........(22500000)..........(22600000)..........(22700000)..........(22800000)..........(22900000)..........(23000000)..........(23100000)..........(23200000)..........(23300000)..........(23400000)..........(23500000)..........(23600000)..........(23700000)..........(23800000)..........(23900000)..........(24000000)..........(24100000)..........(24200000)..........(24300000)..........(24400000)..........(24500000)..........(24600000)..........(24700000)..........(24800000)..........(24900000)..........(25000000)..........(25100000)..........(25200000)..........(25300000)..........(25400000)..........(25500000)..........(25600000)..........(25700000)..........(25800000)..........(25900000)..........(26000000)..........(26100000)..........(26200000)..........(26300000)..........(26400000)..........(26500000)..........(26600000)..........(26700000)..........(26800000)..........(26900000)..........(27000000)..........(27100000)..........(27200000)..........(27300000)..........(27400000)..........(27500000)..........(27600000)..........(27700000)..........(27800000)..........(27900000)..........(28000000)..........(28100000)..........(28200000)..........(28300000)..........(28400000)..........(28500000)..........(28600000)..........(28700000)..........(28800000)..........(28900000)..........(29000000)..........(29100000)..........(29200000)..........(29300000)..........(29400000)..........(29500000)..........(29600000)..........(29700000)..........(29800000)..........(29900000)..........(30000000)..........(30100000)..........(30200000)..........(30300000)..........(30400000)..........(30500000)..........(30600000)..........(30700000)..........(30800000)..........(30900000)..........(31000000)..........(31100000)..........(31200000)..........(31300000)..........(31400000)..........(31500000)..........(31600000)..........(31700000)..........(31800000)..........(31900000)..........(32000000)..........(32100000)..........(32200000)..........(32300000)..........(32400000)..........(32500000)..........(32600000)..........(32700000)..........(32800000)..........(32900000)..........(33000000)..........(33100000)..........(33200000)..........(33300000)..........(33400000)..........(33500000)..........(33600000)..........(33700000)..........(33800000)..........(33900000)..........(34000000)..........(34100000)..........(34200000)..........(34300000)..........(34400000)..........(34500000)..........(34600000)..........(34700000)..........(34800000)..........(34900000)..........(35000000)..........(35100000)..........(35200000)..........(35300000)..........(35400000)..........(35500000)..........(35600000)..........(35700000)..........(35800000)..........(35900000)..........(36000000)..........(36100000)..........(36200000)..........(36300000)..........(36400000)..........(36500000)..........(36600000)..........(36700000)..........(36800000)..........(36900000)..........(37000000)..........(37100000)..........(37200000)..........(37300000)..........(37400000)..........(37500000)..........(37600000)..........(37700000)..........(37800000)..........(37900000)..........(38000000)..........(38100000)..........(38200000)..........(38300000)..........(38400000)..........(38500000)..........(38600000)..........(38700000)..........(38800000)..........(38900000)..........(39000000)..........(39100000)..........(39200000)..........(39300000)..........(39400000)..........(39500000)..........(39600000)..........(39700000)..........(39800000)..........(39900000)..........(40000000)..........(40100000)..........(40200000)..........(40300000)..........(40400000)..........(40500000)..........(40600000)..........(40700000)..........(40800000). Input sentences: 40811694 Output sentences: 35762532 clean-corpus.perl: processing data/wmt14_en_fr/wmt14.tokenized.en-fr/tmp/bpe.valid.en & .fr to data/wmt14_en_fr/wmt14.tokenized.en-fr/valid, cutoff 1-250, ratio 1.5 ... Input sentences: 30639 Output sentences: 26854 Traceback (most recent call last): File "/opt/tiger/conda/bin/fairseq-preprocess", line 11, in load_entry_point('fairseq', 'console_scripts', 'fairseq-preprocess')() File "/opt/tiger/conda/lib/python3.7/site-packages/pkg_resources/init.py", line 489, in load_entry_point return get_distribution(dist).load_entry_point(group, name) File "/opt/tiger/conda/lib/python3.7/site-packages/pkg_resources/init.py", line 2852, in load_entry_point return ep.load() File "/opt/tiger/conda/lib/python3.7/site-packages/pkg_resources/init.py", line 2443, in load return self.resolve() File "/opt/tiger/conda/lib/python3.7/site-packages/pkg_resources/init.py", line 2449, in resolve module = import(self.module_name, fromlist=['name'], level=0) File "/home/tiger/lite-transformer/fairseq_cli/preprocess.py", line 1 ../preprocess.py ^ SyntaxError: invalid syntax

Is there any suggestions? Thanks!

Michaelvll commented 3 years ago

Thank you for asking. I am wondering if you are using the latest version? I suppose this problem was solved by #10 .

macn3388 commented 3 years ago

Thank you for asking. I am wondering if you are using the latest version? I suppose this problem was solved by #10 .

Yes, I am sure I used the latest version(commit 935f5e5, I copied the prepare.sh and run this shell), but the problem still exists in my trail.

Michaelvll commented 3 years ago

Thank you for asking. I am wondering if you are using the latest version? I suppose this problem was solved by #10 .

Yes, I am sure I used the latest version(commit 935f5e5, I copied the prepare.sh and run this shell), but the problem still exists in my trail.

TL, DR: replace the fairseq-preprocess with python preprocess.py.

Hi, I think you may need to uninstall the codebase and pip install -e . again in the folder after you update the code, but, btw, you can replace the fairseq-preprocess with python preprocess.py in https://github.com/mit-han-lab/lite-transformer/blob/935f5e53078cced344424bf4244370e3ac435ff4/configs/wmt14.en-fr/prepare.sh#L139.