Can you tell me why even I set batch_size to 4, also occur OOM problem ?
I know maybe the OOM problem because of model save and eval, but I don't know the OOM problem more specific.
Environment information
python /root/anaconda3/lib/python3.6/site-packages/tensor2tensor/bin/t2t_trainer.py --data_dir=./data_dir \
--problem=translate_enzh_bpe50k \
--model=transformer \
--hparams="batch_size=4" \
--hparams_set=transformer_base_single_gpu \
--output_dir=./en_zh_model \
--schedule=continuous_train_and_eval \
--train_steps=900000 \
--t2t_usr_dir=user_dir
process the english data with bpe.
python 3.7
tensor2tensor == 1.9.0
tensorflow-gpu == 1.12.0
OS: <your answer here>
$ pip freeze | grep tensor
# your output here
$ python -V
# your output here
Description
Can you tell me why even I set batch_size to 4, also occur OOM problem ? I know maybe the OOM problem because of model save and eval, but I don't know the OOM problem more specific.
Environment information
python /root/anaconda3/lib/python3.6/site-packages/tensor2tensor/bin/t2t_trainer.py --data_dir=./data_dir \ --problem=translate_enzh_bpe50k \ --model=transformer \ --hparams="batch_size=4" \ --hparams_set=transformer_base_single_gpu \ --output_dir=./en_zh_model \ --schedule=continuous_train_and_eval \ --train_steps=900000 \ --t2t_usr_dir=user_dir process the english data with bpe. python 3.7 tensor2tensor == 1.9.0 tensorflow-gpu == 1.12.0![屏幕快照 2021-04-22 上午10 30 12](https://user-images.githubusercontent.com/33311822/115647151-e71e4380-a355-11eb-81f4-fa8e3e0e1b88.png)
For bugs: reproduction and error logs