Memory Error when I do t2t-datagen

DarkT2014 commented 5 years ago

Description

I followed the tutorials in https://cloud.google.com/tpu/docs/tutorials/transformer to build a translation model, the Memory Error occurs when I use t2t-datagen with changing problem para to translate_enzh_wmt32k. It also occurs when I do it on my own machine. Is it a bug or anything should I do?

Environment information

GCP

OS: <your answer here>

$ pip freeze | grep tensor
# mesh-tensorflow==0.0.5
tensor2tensor==1.13.2
tensorboard==1.14.0
tensorflow==1.14.0rc0
tensorflow-datasets==1.0.2
tensorflow-metadata==0.13.0
tensorflow-probability==0.7.0
tensorflow-serving-api==1.12.0

$ python -V
# Python 2.7.13

For bugs: reproduction and error logs

# Steps to reproduce:
STORAGE_BUCKET=gs://darkt_t2t
lincheng37211@lincheng37211:~$ DATA_DIR=$STORAGE_BUCKET/data
lincheng37211@lincheng37211:~$ TMP_DIR=/mnt/disks/newdisk/t2t_tmp
lincheng37211@lincheng37211:~$ mkdir /mnt/disks/newdisk/t2t_tmp
lincheng37211@lincheng37211:~$ t2t-datagen --problem=translate_enzh_wmt32k --data_dir=$DATA_DIR --tmp_dir=$TMP_DIR

# Error logs:
I0715 03:12:54.598854 140075642348992 generator_utils.py:229] Not downloading, file already found: /mnt/disks/newdisk/t2t_tmp/training-parallel-nc-v13.tgz
I0715 03:12:54.598951 140075642348992 generator_utils.py:380] Reading file: training-parallel-nc-v13/news-commentary-v13.zh-en.zh
I0715 03:12:59.571671 140075642348992 text_encoder.py:722] Trying min_count 500
I0715 03:13:00.638477 140075642348992 text_encoder.py:802] Iteration 0
I0715 03:13:02.438431 140075642348992 text_encoder.py:825] Processing token [世界上其他的国家也许会任由中美两国被它们排放的废物所吞没] took 0.197419881821 seconds, consider setting Text2TextProblem.max_subtoken_length to a smaller value.
I0715 03:13:04.429254 140075642348992 text_encoder.py:825] Processing token [中国显然正在利用] took 0.38846206665 seconds, consider setting Text2TextProblem.max_subtoken_length to a smaller value.
I0715 03:13:08.614118 140075642348992 text_encoder.py:825] Processing token [这是中美相互依存进入破坏阶段所造成的双边不信任程度深化的一个鲜明例子] took 0.834113121033 seconds, consider setting Text2TextProblem.max_subtoken_length to a smaller value.
I0715 03:13:17.385082 140075642348992 text_encoder.py:825] Processing token [但要靠如此简单的做法解决我们目前面临的极度复杂并互相联系的生态危机是不可能的] took 1.78553104401 seconds, consider setting Text2TextProblem.max_subtoken_length to a smaller value.
I0715 03:13:36.257227 140075642348992 text_encoder.py:825] Processing token [阿伦特中心邀请琼根发言的决策] took 3.45979595184 seconds, consider setting Text2TextProblem.max_subtoken_length to a smaller value.
Traceback (most recent call last):
  File "/usr/local/bin/t2t-datagen", line 28, in <module>
    tf.app.run()
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/platform/app.py", line 40, in run
    _run(main=main, argv=argv, flags_parser=_parse_flags_tolerate_undef)
  File "/usr/local/lib/python2.7/dist-packages/absl/app.py", line 300, in run
    _run_main(main, args)
  File "/usr/local/lib/python2.7/dist-packages/absl/app.py", line 251, in _run_main
    sys.exit(main(argv))
  File "/usr/local/bin/t2t-datagen", line 23, in main
    t2t_datagen.main(argv)
  File "/usr/local/lib/python2.7/dist-packages/tensor2tensor/bin/t2t_datagen.py", line 216, in main
    generate_data_for_registered_problem(problem)
  File "/usr/local/lib/python2.7/dist-packages/tensor2tensor/bin/t2t_datagen.py", line 301, in generate_data_for_registered_problem
    problem.generate_data(data_dir, tmp_dir, task_id)
  File "/usr/local/lib/python2.7/dist-packages/tensor2tensor/data_generators/text_problems.py", line 361, in generate_data
    self.generate_encoded_samples(data_dir, tmp_dir, split), paths)
  File "/usr/local/lib/python2.7/dist-packages/tensor2tensor/data_generators/translate_enzh.py", line 235, in generate_encoded_samples
    max_subtoken_length=self.max_subtoken_length)
  File "/usr/local/lib/python2.7/dist-packages/tensor2tensor/data_generators/generator_utils.py", line 368, in get_or_generate_vocab
    vocab_generator, max_subtoken_length)
  File "/usr/local/lib/python2.7/dist-packages/tensor2tensor/data_generators/generator_utils.py", line 352, in get_or_generate_vocab_inner
    reserved_tokens=reserved_tokens)
  File "/usr/local/lib/python2.7/dist-packages/tensor2tensor/data_generators/text_encoder.py", line 673, in build_from_generatorexit
    reserved_tokens=reserved_tokens)
  File "/usr/local/lib/python2.7/dist-packages/tensor2tensor/data_generators/text_encoder.py", line 748, in build_to_target_size
    return bisect(min_val, max_val)
  File "/usr/local/lib/python2.7/dist-packages/tensor2tensor/data_generators/text_encoder.py", line 727, in bisect
    reserved_tokens=reserved_tokens)
  File "/usr/local/lib/python2.7/dist-packages/tensor2tensor/data_generators/text_encoder.py", line 819, in build_from_token_counts
    subtoken_counts[new_subtoken] += count
MemoryError

Santosh-Gupta commented 5 years ago

This suggestion seems to be appearing often

Processing token [阿伦特中心邀请琼根发言的决策] took 3.45979595184 seconds, consider setting Text2TextProblem.max_subtoken_length to a smaller value.

I noticed that the max subtoken length is 200 for the default. I am trying to figure out if that's the number of sub tokens (I would imagine that would be the vocab size if we're using the subword model) or if that's the length of the word to be represented by a subtoken. If it's the latter, that is very large; 200 longer than most words.

Let me know if you figure out how exactly max_subtoken_length is defined, and if it should be changed.

The parameter default is set here:

https://github.com/tensorflow/tensor2tensor/blob/master/tensor2tensor/data_generators/text_problems.py#L324

This is where max_subtoken_length is used

https://github.com/tensorflow/tensor2tensor/blob/master/tensor2tensor/data_generators/text_encoder.py#L815

CSLujunyu commented 5 years ago

Did you figure out how to define max_subtoken_length manually?

DarkT2014 commented 5 years ago

update: I tried to change max_subtoken_length in the source code, and build it. But the question is, neither bigger nor smaller the value made, the problem still exist. I think maybe the tokenizer for chinese have some problems, is it a bug?

tensorflow / tensor2tensor