Closed pauli31 closed 3 months ago
Can replicate with bitsandbytes-4nf quantization. However, it works as expected with awq quantization. The bug seems to be specifically in the bitsandbytes implementation.
Is there any update in this issue?
This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.
System Info
Hi, if I run TGI in docker on Nvidia 4070 RTX (12GB memory), I'm getting
CUDA error: device-side assert triggered
for the specific request (at the bottom). I'm getting error when I set the max sequence length to 6k, while when I set the max seq len to 8k I do not get the error. It happens for a long input (the shorter works fine). The maximum seq len of the modelopenchat/openchat-3.5-0106
is 8k, but I would like to restrict it to 6k.Information
Tasks
Reproduction
Docker run error occurs:
Docker run error does not occur:
Error I'm getting:
The input in swagger in generate endpoint
http://localhost:8080/docs/#/Text%20Generation%20Inference/generate
Expected behavior
It should work even with the reduced sequence length