unslothai / unsloth

Finetune Llama 3.1, Mistral, Phi & Gemma LLMs 2-5x faster with 80% less memory
https://unsloth.ai
Apache License 2.0
15.57k stars 1.04k forks source link

getting cuda error "CUBLAS_STATUS_INTERNAL_ERROR" after using "left" side for "padding" and "truncation" #296

Open ml-maddi opened 5 months ago

ml-maddi commented 5 months ago

I have been using unsloth to train a chatbot model using the mistral 7b instruct model. I have successfully fine-tuned it and got inference results accordingly.

But the only issue I am having is that, the model never stops generating its token till the "max_new_tokens" and never ends with proper complete answer with "eos_token". The output seems very correct, but more outputs are there, seems it just got truncated.

To fix this issue, I have tried to use tokenizer.padding_side="left" and tokenizer.truncation_side="left" , as its mentioned here wrong_padding_side. But after using this modification for further fine-tuning , using the previously fine-tuned checkpoint(padding_side = "right") , I am getting the following cuda error after some training steps like 35 to 39

CUDA error: CUBLAS_STATUS_INTERNAL_ERROR when calling cublasGemmEx( handle, opa, opb, m, n, k, &falpha, a, CUDA_R_16BF, lda, b, CUDA_R_16BF, ldb, &fbeta, c, CUDA_R_16BF, ldc, CUDA_R_32F, .

How can i solve this cuda issue or if i have to use "padding_side" as "right", how can i ensure my model gives a completely finished output within the "max_new_tokens" limit with proper ending?

Note: the modified fine-tuning seems to be going okay on kaggle , but I am facing the error when I am trying to fine-tune locally with GPU RTX 3090

danielhanchen commented 5 months ago

@ml-maddi Can you screenshot Unsloth's info section

ml-maddi commented 5 months ago

Screenshot 2024-04-03 at 1 16 54 AM Screenshot 2024-04-03 at 1 17 11 AM Screenshot 2024-04-03 at 1 17 37 AM Screenshot 2024-04-03 at 1 17 46 AM Screenshot 2024-04-03 at 1 18 01 AM

ml-maddi commented 5 months ago

_ss_1 _ss_2 _ss_3 _ss_4 _ss_5 _ss_6 _ss_7 _ss_8 _ss_9 _ss_10 _ss_11 _ss_12

danielhanchen commented 5 months ago

@ml-maddi Oh that's long! It seems like your local installation of CUDA might be broken maybe? Does non Unsloth code paths work as expected, or is this just an Unsloth issue? (Ie pure HF / TRL training scripts)