unslothai / unsloth

Finetune Llama 3.2, Mistral, Phi, Qwen 2.5 & Gemma LLMs 2-5x faster with 80% less memory
https://unsloth.ai
Apache License 2.0
18.03k stars 1.25k forks source link

Exception: CUDA error: an illegal memory access was encountered. CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect. #1055

Open vhiwase opened 1 month ago

vhiwase commented 1 month ago

I attempted to serve the original base model of Llama 3.1 in 4-bit, both with and without setting load_in_4bit. Below are my observations.

When load_in_4bit = True: The model throws the following error:

Exception: CUDA error: an illegal memory access was encountered. CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1. Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

However, this behavior does not occur immediately—it happens after the model has processed some initial data. The model also consumes 8 GB of memory.

Code:

max_seq_length = 4200  
dtype = None  # Auto detection; Float16 for Tesla T4, V100; Bfloat16 for Ampere+
load_in_4bit = True  # Use 4-bit quantization to reduce memory usage.

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name="unsloth/Meta-Llama-3.1-8B-Instruct-bnb-4bit",
    max_seq_length=max_seq_length,
    dtype=dtype,
    load_in_4bit=load_in_4bit,
    # token="hf_..."  # Required for gated models like Meta-Llama/Llama-2-7b-hf
)

When load_in_4bit = False: The model runs without errors and uses around 16 GB of memory.

Code:

max_seq_length = 4200 
dtype = None  # Auto detection; Float16 for Tesla T4, V100; Bfloat16 for Ampere+
load_in_4bit = False  # Disable 4-bit quantization.

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name="unsloth/Meta-Llama-3.1-8B-Instruct-bnb-4bit",
    max_seq_length=max_seq_length,
    dtype=dtype,
    load_in_4bit=load_in_4bit,
    # token="hf_..."  # Required for gated models like Meta-Llama/Llama-2-7b-hf
)

Based on these findings, it seems that if we trained with load_in_4bit = True, the same issue would persist in our fine-tuned model, as it is inherent to the base model.

I recommend that we should train this model again for load_in_4bit = True

danielhanchen commented 1 month ago

@vhiwase Apologies on the delay! Would you happen to know what dataset you might have been using - it's possible there are some weird out of bounds tokens causing errors

vhiwase commented 4 weeks ago

@danielhanchen Apologies for the delay in responding. I'm currently testing the model with results obtained from OCR processing using Azure Document Intelligence. The inputs consist of random chunks of text extracted from various documents.

danielhanchen commented 3 weeks ago

@vhiwase No worries! Does this happen on other machines? Like in a Colab?

vhiwase commented 2 weeks ago

@danielhanchen You are correct that we trained the model on Amazon EC2 G6 Instances, and inference is working fine there. However, we hosted the model inference on a different machine—specifically, Amazon EC2 G6e Instances. Could this be related to the dtype setting?

dtype = None # None for auto detection. Float16 for Tesla T4, V100, Bfloat16 for Ampere+.

Note:

G6 Instances: Feature up to 8 NVIDIA L4 Tensor Core GPUs with 24 GB of memory per GPU, and third generation AMD EPYC processors.

G6e Instances: Feature up to 8 NVIDIA L40S Tensor Core GPUs with 384 GB of total GPU memory (48 GB per GPU), and third generation AMD EPYC processors.