pytorch / ao

PyTorch native quantization and sparsity for training and inference
BSD 3-Clause "New" or "Revised" License
1.62k stars 179 forks source link

RuntimeError: CUDA error: named symbol not found CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1 Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions. #968

Open kolyan288 opened 2 months ago

kolyan288 commented 2 months ago

I tried to execute the following code:

from transformers import TorchAoConfig, AutoModelForCausalLM, AutoTokenizer
import torch

model_name = "IlyaGusev/saiga_llama3_8b"
quantization_config = TorchAoConfig("int4_weight_only", group_size=128)
quantized_model = AutoModelForCausalLM.from_pretrained(model_name, device_map="cuda", quantization_config=quantization_config)

tokenizer = AutoTokenizer.from_pretrained(model_name)
input_text = "What are we having for dinner?"
input_ids = tokenizer(input_text, return_tensors="pt").to("cuda")

# compile the quantized model to get speedup
import torchao
torchao.quantization.utils.recommended_inductor_config_setter()
quantized_model = torch.compile(quantized_model, mode="max-autotune")

output = quantized_model.generate(**input_ids, max_new_tokens=10)
print(tokenizer.decode(output[0], skip_special_tokens=True))

And got the following:

File ~/anaconda3/envs/LLMs/lib/python3.12/site-packages/torchao/quantization/utils.py:322, in pack_tinygemm_scales_and_zeros(scales, zeros, dtype)
    319 guard_dtype_size(scales, "scales", dtype=dtype, size=zeros.size())
    320 guard_dtype_size(zeros, "zeros", dtype=dtype)
    321 return (
--> 322     torch.cat(
    323         [
    324             scales.reshape(scales.size(0), scales.size(1), 1),
    325             zeros.reshape(zeros.size(0), zeros.size(1), 1),
    326         ],
    327         2,
    328     )
    329     .transpose(0, 1)
    330     .contiguous()
    331 )

RuntimeError: CUDA error: named symbol not found
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

The pack_tinygemm_scales_and_zeros function looks like this:

def pack_tinygemm_scales_and_zeros(scales, zeros, dtype=torch.bfloat16):
    guard_dtype_size(scales, "scales", dtype=dtype, size=zeros.size())
    guard_dtype_size(zeros, "zeros", dtype=dtype)
    return (
        torch.cat(
            [
                scales.reshape(scales.size(0), scales.size(1), 1),
                zeros.reshape(zeros.size(0), zeros.size(1), 1),
            ],
            2,
        )
        .transpose(0, 1)
        .contiguous()
    )

GPU: NVIDIA GTX 1060 3GB CUDA: 12.1 NVIDIA-SMI 530.30.02
Driver Version: 530.30.02

System: Host: linuxhome-desktop Kernel: 5.15.0-56-generic x86_64 bits: 64 Desktop: Cinnamon 5.6.5 Distro: Linux Mint 21.1 Vera

I attribute this error to the fact that my GPU does not support bfloat16, but what do you think?

gau-nernst commented 2 months ago

Yes, I think that is the issue. tinygemm (int4_weight_only) requires BF16. You can try other quantization method.