Closed Fuckingnameless closed 2 months ago
models i tried: LoneStriker's miqu gptq and Qwen gptq int4 mixtral gptq runs fine which would hint at lack of vram being the issue but i have 48GB and set context size at 2048@Q4
This is also affecting Oobabooga Text generation WebUI - Issue 5851.
In Oobabooga, the model actually loads without any issues - the error is raised when trying to do inference against a GPTQ model loaded with ExLlamaV2.
I looked into it and found a bug in one of the GPTQ kernels. I'll release a new version in a couple of days but, if possible, could you try the dev branch to see if that fixes your issue?
I looked into it and found a bug in one of the GPTQ kernels. I'll release a new version in a couple of days but, if possible, could you try the dev branch to see if that fixes your issue?
Thanks, I'll test it tomorrow.
Seems to still be happening on dev
branch.
Traceback (most recent call last):
File "/workspace/text-generation-webui/modules/callbacks.py", line 61, in gentask
ret = self.mfunc(callback=_callback, *args, **self.kwargs)
File "/workspace/text-generation-webui/modules/text_generation.py", line 382, in generate_with_callback
shared.model.generate(**kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "/workspace/venvs/text-generation-webui/lib/python3.10/site-packages/transformers/generation/utils.py", line 1575, in generate
result = self._sample(
File "/workspace/venvs/text-generation-webui/lib/python3.10/site-packages/transformers/generation/utils.py", line 2697, in _sample
outputs = self(
File "/workspace/text-generation-webui/modules/exllamav2_hf.py", line 136, in __call__
self.ex_model.forward(seq_tensor[:-1].view(1, -1), ex_cache, preprocess_only=True, loras=self.loras)
File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "/workspace/venvs/text-generation-webui/lib/python3.10/site-packages/exllamav2/model.py", line 694, in forward
r, ls = self._forward(input_ids = input_ids[:, chunk_begin : chunk_end],
File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "/workspace/venvs/text-generation-webui/lib/python3.10/site-packages/exllamav2/model.py", line 776, in _forward
x = module.forward(x, cache = cache, attn_params = attn_params, past_len = past_len, loras = loras, **kwargs)
File "/workspace/venvs/text-generation-webui/lib/python3.10/site-packages/exllamav2/attn.py", line 596, in forward
attn_output = flash_attn_func(q_states, k_states, v_states, causal = True)
File "/workspace/venvs/text-generation-webui/lib/python3.10/site-packages/flash_attn/flash_attn_interface.py", line 825, in flash_attn_func
return FlashAttnFunc.apply(
File "/usr/local/lib/python3.10/dist-packages/torch/autograd/function.py", line 553, in apply
return super().apply(*args, **kwargs) # type: ignore[misc]
File "/workspace/venvs/text-generation-webui/lib/python3.10/site-packages/flash_attn/flash_attn_interface.py", line 507, in forward
out, q, k, v, out_padded, softmax_lse, S_dmask, rng_state = _flash_attn_forward(
File "/workspace/venvs/text-generation-webui/lib/python3.10/site-packages/flash_attn/flash_attn_interface.py", line 51, in _flash_attn_forward
out, q, k, v, out_padded, softmax_lse, S_dmask, rng_state = flash_attn_cuda.fwd(
RuntimeError: CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
Oh my bad, I see dev
branch disappeared probably due to the 0.0.19
release so I'll test that instead.
I can confirm that the issue is resolved in version 0.0.19, thank you!
RuntimeError: CUDA error: an illegal memory access was encountered CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1. Compile with
TORCH_USE_CUDA_DSA
to enable device-side assertions.