tensorflow / tensor2tensor

Library of deep learning models and datasets designed to make deep learning more accessible and accelerate ML research.
Apache License 2.0
15.11k stars 3.44k forks source link

train meachine translation OOM #1885

Closed charlesfufu closed 3 years ago

charlesfufu commented 3 years ago

Description

屏幕快照 2021-04-22 上午10 18 34

Can you tell me why even I set batch_size to 4, also occur OOM problem ? I know maybe the OOM problem because of model save and eval, but I don't know the OOM problem more specific.

Environment information

python /root/anaconda3/lib/python3.6/site-packages/tensor2tensor/bin/t2t_trainer.py --data_dir=./data_dir \ --problem=translate_enzh_bpe50k \ --model=transformer \ --hparams="batch_size=4" \ --hparams_set=transformer_base_single_gpu \ --output_dir=./en_zh_model \ --schedule=continuous_train_and_eval \ --train_steps=900000 \ --t2t_usr_dir=user_dir process the english data with bpe. python 3.7 tensor2tensor == 1.9.0 tensorflow-gpu == 1.12.0 屏幕快照 2021-04-22 上午10 30 12

OS: <your answer here>

$ pip freeze | grep tensor
# your output here

$ python -V
# your output here

For bugs: reproduction and error logs

# Steps to reproduce:
...
# Error logs:
...