qwopqwop200 / GPTQ-for-LLaMa

4 bits quantization of LLaMA using GPTQ
Apache License 2.0
2.98k stars 457 forks source link

"CUDA Error: No kernel image is available" #151

Open Yona-W opened 1 year ago

Yona-W commented 1 year ago

My configuration is as follows:

I get the following output when running the benchmark:

Benchmarking LLaMa-7B FC2 matvec ...
FP16: 0.0007498373985290527
2bit: 2.6212453842163085e-05
Traceback (most recent call last):
  File "/app/repositories/GPTQ-for-LLaMa/test_kernel.py", line 51, in <module>
    mat = torch.randint(-1000000000, 1000000000, (M // 32 * 3, N), device=DEV, dtype=torch.int)
RuntimeError: CUDA error: no kernel image is available for execution on the device
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

I get a similar output (obviously with a different stack trace) when trying to run inference on the model. Everything loads correctly, the error only happens when something is evaluated.

mxbi commented 1 year ago

I got the same thing today with an NVIDIA A100. Did you ever figure it out?

Yona-W commented 1 year ago

Ah, I totally forgot I had opened an issue.

For my situation, I figured out that Torch 2.0 has problems specifically with the 1080Ti, and modifying the Containerfile to use Torch < 2.0 solved my issues.

With an A100 though, I'm not sure what could be causing it. I doubt Torch would have issues with pretty much the most popular ML card. If you're using the same Containerfile, I guess make sure the the correct CUDA architecture is listed in the defines?

dougbtv commented 1 year ago

I'm also using podman (on Fedora 38) and running into this as well, I also have an issue filed @ https://github.com/oobabooga/text-generation-webui/issues/2002

Thanks for the link to RedTopper/Text-Generation-Webui-Podman, I hadn't been using that.

The version of GPTQ for LLaMa used in the text-generation-webui is a fork of this repo @ https://github.com/oobabooga/GPTQ-for-LLaMa but I came looking

In my case, I ran into the error here, with:

  File "/app/repositories/GPTQ-for-LLaMa/quant.py", line 431, in forward
    y = y.to(output_dtype)
RuntimeError: CUDA error: no kernel image is available for execution on the device
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

(Full stack trace in linked issue)