sacmehta / delight

DeLighT: Very Deep and Light-Weight Transformers
MIT License
467 stars 53 forks source link

Unable to do fp16 training. #4

Closed sugeeth14 closed 4 years ago

sugeeth14 commented 4 years ago

It is mentioned to install apex but in the training command no option has been given. Tried to use default --fp16 command from fairseq but getting the below error.

RuntimeError: CUDA error: CUBLAS_STATUS_EXECUTION_FAILED when calling `cublasGemmEx( handle, opa, opb, m, n, k, &falpha, a, CUDA_R_16F, lda, b, CUDA_R_16F, ldb, &fbeta, c, CUDA_R_16F, ldc, CUDA_R_32F, CUB
LAS_GEMM_DFALT_TENSOR_OP)` 

Want to do to train on --fp16 . Please suggest. Thanks.

sacmehta commented 4 years ago

There is a bug in cuDNN/CUDA version that it can't handle large batch size for matrix multiplication.

Which CUDA version are you using?

sugeeth14 commented 4 years ago

I am using CUDA version 10.1 . But my other trainings with transformer_big are running withfp16 from fairseq in the same version do you think it is related to it ? I am using --d-m 512 by the way. Should I reduce max_tokens if so what is the ideal value ?

sacmehta commented 4 years ago

You need to use CUDA 10.2+. Since Transformer_big uses NVIDIA's dedicated kernel, it does not encounter matrix multiplication issue when using large matrices.

There was a bug in CUDA 10.1. See here: https://github.com/pytorch/pytorch/issues/24018#issuecomment-528004576

sugeeth14 commented 4 years ago

Thanks will check out closing for now.