tensorflow / tensor2tensor

Library of deep learning models and datasets designed to make deep learning more accessible and accelerate ML research.
Apache License 2.0
15.33k stars 3.47k forks source link

Python 2 t2t-datagen fails on Unicode errors #102

Closed dakami closed 7 years ago

dakami commented 7 years ago

$ python /usr/local/bin/t2t-datagen --data_dir=$DATA_DIR --tmp_dir=$TMP_DIR --num_shards=100 --problem=$PROBLEM INFO:tensorflow:Generating training data for wmt_ende_tokens_32k. INFO:tensorflow:Not downloading, file already found: /mnt/store/t2t/t2t_datagen/training-parallel-nc-v11.tgz INFO:tensorflow:Reading file: training-parallel-nc-v11/news-commentary-v11.de-en.en /usr/local/lib/python2.7/dist-packages/tensor2tensor-1.0.10-py2.7.egg/tensor2tensor/data_generators/tokenizer.py:82: UnicodeWarning: Unicode equal comparison failed to convert both arguments to Unicode - interpreting them as being unequal /usr/local/lib/python2.7/dist-packages/tensor2tensor-1.0.10-py2.7.egg/tensor2tensor/data_generators/tokenizer.py:86: UnicodeWarning: Unicode unequal comparison failed to convert both arguments to Unicode - interpreting them as being unequal INFO:tensorflow:Reading file: training-parallel-nc-v11/news-commentary-v11.de-en.de INFO:tensorflow:Not downloading, file already found: /mnt/store/t2t/t2t_datagen/training-parallel-commoncrawl.tgz INFO:tensorflow:Reading file: commoncrawl.de-en.en INFO:tensorflow:Reading file: commoncrawl.de-en.de INFO:tensorflow:Reading file: commoncrawl.fr-en.en INFO:tensorflow:Reading file: commoncrawl.fr-en.fr INFO:tensorflow:Not downloading, file already found: /mnt/store/t2t/t2t_datagen/training-parallel-europarl-v7.tgz INFO:tensorflow:Reading file: training/europarl-v7.de-en.en INFO:tensorflow:Reading file: training/europarl-v7.de-en.de INFO:tensorflow:Reading file: training/europarl-v7.fr-en.en INFO:tensorflow:Reading file: training/europarl-v7.fr-en.fr INFO:tensorflow:Not downloading, file already found: /mnt/store/t2t/t2t_datagen/training-giga-fren.tar INFO:tensorflow:Reading file: giga-fren.release2.fixed.en.gz INFO:tensorflow:Subdirectory /mnt/store/t2t/t2t_datagen/giga-fren.release2.fixed.en.gz already exists, skipping unpacking INFO:tensorflow:Reading file: giga-fren.release2.fixed.fr.gz INFO:tensorflow:Subdirectory /mnt/store/t2t/t2t_datagen/giga-fren.release2.fixed.fr.gz already exists, skipping unpacking INFO:tensorflow:Not downloading, file already found: /mnt/store/t2t/t2t_datagen/training-parallel-un.tgz INFO:tensorflow:Reading file: un/undoc.2000.fr-en.en INFO:tensorflow:Reading file: un/undoc.2000.fr-en.fr INFO:tensorflow:Trying min_count 500 INFO:tensorflow:Iteration 0 /usr/local/lib/python2.7/dist-packages/tensor2tensor-1.0.10-py2.7.egg/tensor2tensor/data_generators/text_encoder.py:475: UnicodeWarning: Unicode equal comparison failed to convert both arguments to Unicode - interpreting them as being unequal /usr/local/lib/python2.7/dist-packages/tensor2tensor-1.0.10-py2.7.egg/tensor2tensor/data_generators/text_encoder.py:417: UnicodeWarning: Unicode equal comparison failed to convert both arguments to Unicode - interpreting them as being unequal Traceback (most recent call last): File "/usr/local/bin/t2t-datagen", line 378, in tf.app.run() File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/platform/app.py", line 48, in run _sys.exit(main(_sys.argv[:1] + flags_passthrough)) File "/usr/local/bin/t2t-datagen", line 361, in main training_gen(), FLAGS.problem + UNSHUFFLED_SUFFIX + "-train", File "/usr/local/bin/t2t-datagen", line 151, in lambda: wmt.ende_wordpiece_token_generator(FLAGS.tmp_dir, True, 2**15), File "build/bdist.linux-x86_64/egg/tensor2tensor/data_generators/wmt.py", line 230, in ende_wordpiece_token_generator File "build/bdist.linux-x86_64/egg/tensor2tensor/data_generators/generator_utils.py", line 265, in get_or_generate_vocab File "build/bdist.linux-x86_64/egg/tensor2tensor/data_generators/text_encoder.py", line 343, in build_to_target_size File "build/bdist.linux-x86_64/egg/tensor2tensor/data_generators/text_encoder.py", line 329, in bisect File "build/bdist.linux-x86_64/egg/tensor2tensor/data_generators/text_encoder.py", line 417, in build_from_token_counts UnicodeDecodeError: 'ascii' codec can't decode byte 0x80 in position 0: ordinal not in range(128)

$echo $PROBLEM wmt_ende_tokens_32k

stefan-it commented 7 years ago

It's a known problem, see more information here :)

lukaszkaiser commented 7 years ago

This is hopefully corrected in 1.0.11. please give it a try. I'm closing for now, please reopen if you still see the issue.