Open Likede15 opened 4 years ago
Ps, Single GPU is ok for batch_size=1024 8 GPU training regardless of batch_size how much is set, oom will always be reported.
Hi @Likede15
I am experiencing the same problem. Have you solved it yet?
Hi @Likede15
I am experiencing the same problem. Have you solved it yet?
I changed tensorflow version 2.2 to 1.15.
Hi @Likede15
Thanks for your reply. I will give it a try and will let you know.
Description
We customize the translation problem and use our own dictionary. When setting worker gpu=8 batch size=1024 model=transformer_ big, OOM error occurs during training.
Some of the error messages are as follows:
Environment information
For bugs: reproduction and error logs