Closed JackCai1206 closed 3 months ago
cc @ArthurZucker
Hi @JackCai1206 I ran your script but didn't encounter the error that you mentioned for LlamaConfig
and ran smoothly for both. Can you check your pytorch cuda compatibility as I have a version 12.2 with pytorch 2.3 (PyTorch version (GPU?): 2.3.0+cu121 (True)
, Cuda compilation tools, release 12.2, V12.2.140
)?
When I run nvidia-smi I get | NVIDIA-SMI 545.23.08 Driver Version: 545.23.08 CUDA Version: 12.3 |
and i have installed torch 2.3.0 without "cu" suffixes, which I assume is compatible with cuda 12?
@JackCai1206 There are two main APIs of CUDA, the runtime and the driver. The nvidia CUDA version you have posted is the driver API version and what we have with pytorch is the runtime API one which we get after cuda toolkit gets installed automatically with pip3 install torch
Just for confirmation can you check the output of pip list | grep torch
and torch.version.cuda
. If the outputs does show no cuda dependencies and None
respectively then we have to reinstall pytorch with cuda dependencies.
Hi, thanks for the explanation! This is the output of pip list
torch 2.3.0
torchaudio 2.3.0
torchvision 0.18.0
and torch cuda version
>>> import torch
>>> torch.version.cuda
'12.1'
also cc @gante
Hi, thanks for the explanation! This is the output of pip list
torch 2.3.0 torchaudio 2.3.0 torchvision 0.18.0
and torch cuda version
>>> import torch >>> torch.version.cuda '12.1'
@JackCai1206 Oh! I see. What i found could be the reason for the error is this line in modeling_llama
as your model has (rotary_emb): LlamaRotaryEmbedding()
. It forces float32
as bfloat16
loses precision on long context.
If you want to use autocast then an alternative trial could be to use Trainer
class of transformers and activate autocast
through bf16=True
argument in TrainingArguments
Sounds good. Yeah i think a warning message there could be useful.
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.
Please note that issues that do not follow the contributing guidelines are likely to be ignored.
System Info
transformers 4.41.0 torch 2.3.0 GPU: NVIDIA GeForce RTX 4090, CUDA version 12.3
Who can help?
No response
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
Expected behavior
Running the code snippet above gives me the following error
This problem does not seem to happen for a GPT2 model. If I initialize the GPT2Config instead of LlamaConfig in the commented code in the script, there is no such error.