train meachine translation OOM

Description

屏幕快照 2021-04-22 上午10 18 34

Can you tell me why even I set batch_size to 4, also occur OOM problem ？ I know maybe the OOM problem because of model save and eval, but I don't know the OOM problem more specific.

Environment information

python /root/anaconda3/lib/python3.6/site-packages/tensor2tensor/bin/t2t_trainer.py --data_dir=./data_dir \ --problem=translate_enzh_bpe50k \ --model=transformer \ --hparams="batch_size=4" \ --hparams_set=transformer_base_single_gpu \ --output_dir=./en_zh_model \ --schedule=continuous_train_and_eval \ --train_steps=900000 \ --t2t_usr_dir=user_dir process the english data with bpe. python 3.7 tensor2tensor == 1.9.0 tensorflow-gpu == 1.12.0 屏幕快照 2021-04-22 上午10 30 12

OS: <your answer here>

$ pip freeze | grep tensor
# your output here

$ python -V
# your output here

For bugs: reproduction and error logs

# Steps to reproduce:
...

# Error logs:
...

tensorflow / tensor2tensor