Closed sugeeth14 closed 4 years ago
There is a bug in cuDNN/CUDA version that it can't handle large batch size for matrix multiplication.
Which CUDA version are you using?
I am using CUDA version 10.1
. But my other trainings with transformer_big
are running withfp16
from fairseq in the same version do you think it is related to it ? I am using --d-m
512 by the way. Should I reduce max_tokens
if so what is the ideal value ?
You need to use CUDA 10.2+. Since Transformer_big uses NVIDIA's dedicated kernel, it does not encounter matrix multiplication issue when using large matrices.
There was a bug in CUDA 10.1. See here: https://github.com/pytorch/pytorch/issues/24018#issuecomment-528004576
Thanks will check out closing for now.
It is mentioned to install apex but in the training command no option has been given. Tried to use default
--fp16
command from fairseq but getting the below error.Want to do to train on
--fp16
. Please suggest. Thanks.