turboderp / exllamav2

A fast inference library for running LLMs locally on modern consumer-class GPUs
MIT License
3.19k stars 234 forks source link

Torch error when loading GPTQ model #413

Closed Fuckingnameless closed 2 months ago

Fuckingnameless commented 2 months ago

RuntimeError: CUDA error: an illegal memory access was encountered CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1. Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

Fuckingnameless commented 2 months ago

models i tried: LoneStriker's miqu gptq and Qwen gptq int4 mixtral gptq runs fine which would hint at lack of vram being the issue but i have 48GB and set context size at 2048@Q4

ashleykleynhans commented 2 months ago

This is also affecting Oobabooga Text generation WebUI - Issue 5851.

ashleykleynhans commented 2 months ago

In Oobabooga, the model actually loads without any issues - the error is raised when trying to do inference against a GPTQ model loaded with ExLlamaV2.

turboderp commented 2 months ago

I looked into it and found a bug in one of the GPTQ kernels. I'll release a new version in a couple of days but, if possible, could you try the dev branch to see if that fixes your issue?

ashleykleynhans commented 2 months ago

I looked into it and found a bug in one of the GPTQ kernels. I'll release a new version in a couple of days but, if possible, could you try the dev branch to see if that fixes your issue?

Thanks, I'll test it tomorrow.

ashleykleynhans commented 2 months ago

Seems to still be happening on dev branch.

Traceback (most recent call last):
  File "/workspace/text-generation-webui/modules/callbacks.py", line 61, in gentask
    ret = self.mfunc(callback=_callback, *args, **self.kwargs)
  File "/workspace/text-generation-webui/modules/text_generation.py", line 382, in generate_with_callback
    shared.model.generate(**kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/workspace/venvs/text-generation-webui/lib/python3.10/site-packages/transformers/generation/utils.py", line 1575, in generate
    result = self._sample(
  File "/workspace/venvs/text-generation-webui/lib/python3.10/site-packages/transformers/generation/utils.py", line 2697, in _sample
    outputs = self(
  File "/workspace/text-generation-webui/modules/exllamav2_hf.py", line 136, in __call__
    self.ex_model.forward(seq_tensor[:-1].view(1, -1), ex_cache, preprocess_only=True, loras=self.loras)
  File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/workspace/venvs/text-generation-webui/lib/python3.10/site-packages/exllamav2/model.py", line 694, in forward
    r, ls = self._forward(input_ids = input_ids[:, chunk_begin : chunk_end],
  File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/workspace/venvs/text-generation-webui/lib/python3.10/site-packages/exllamav2/model.py", line 776, in _forward
    x = module.forward(x, cache = cache, attn_params = attn_params, past_len = past_len, loras = loras, **kwargs)
  File "/workspace/venvs/text-generation-webui/lib/python3.10/site-packages/exllamav2/attn.py", line 596, in forward
    attn_output = flash_attn_func(q_states, k_states, v_states, causal = True)
  File "/workspace/venvs/text-generation-webui/lib/python3.10/site-packages/flash_attn/flash_attn_interface.py", line 825, in flash_attn_func
    return FlashAttnFunc.apply(
  File "/usr/local/lib/python3.10/dist-packages/torch/autograd/function.py", line 553, in apply
    return super().apply(*args, **kwargs)  # type: ignore[misc]
  File "/workspace/venvs/text-generation-webui/lib/python3.10/site-packages/flash_attn/flash_attn_interface.py", line 507, in forward
    out, q, k, v, out_padded, softmax_lse, S_dmask, rng_state = _flash_attn_forward(
  File "/workspace/venvs/text-generation-webui/lib/python3.10/site-packages/flash_attn/flash_attn_interface.py", line 51, in _flash_attn_forward
    out, q, k, v, out_padded, softmax_lse, S_dmask, rng_state = flash_attn_cuda.fwd(
RuntimeError: CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
ashleykleynhans commented 2 months ago

Oh my bad, I see dev branch disappeared probably due to the 0.0.19 release so I'll test that instead.

ashleykleynhans commented 2 months ago

I can confirm that the issue is resolved in version 0.0.19, thank you!