Closed Cyberes closed 8 months ago
This is a continuation of https://github.com/huggingface/text-generation-inference/issues/929.
After looking at the HeterogeneousTypicalLogitsWarper
class, I set typical_p
to a lower value (0.998
instead of 0.999
) and the server did not crash. My thoughts is that when typical_p
is very close to 1, the warper is trying to keep only the most typical tokens but there might not be enough tokens left to generate a new token, especially if max_new_tokens
is large. Thus, the tensor gather()
fails.
So, I guess keep typical_p
below 0.998
and everything works?
The main concern is that once the server hits this runtime error, all subsequent requests also trigger this error (even if the parameters are acceptable). Could the server maybe gracefully handle this so that it doesn't need to be restarted whenever a bad parameter is sent?
This would be super nice indeed.
However device-side assert will always spoil the current process rendering the process unrecoverable. The best thing we could do is prevent that from happening altogether.
That will likely involve doing some check on the tensor. As long as the check stays on CPU (like counting numel()
) then it would be very easy to do. Anything that would hit the GPU would slow everything down for when things DO work, which we desperately try to avoid.
Unfortunately, I don't have access to A4000 and cannot reproduce on the cards I do have access to. It could also be a bug in triton here (since this models uses triton and not exllama).
Vast.ai has good prices, I think $5 will get you over a day of A4000 runtime. Maybe HF could sponsor?
This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.
keep open
any update? something similar is happening to me in https://github.com/huggingface/text-generation-inference/issues/1721
@pauli31 lol no. I switched to VLLM https://github.com/vllm-project/vllm
Is there any update for this issue?
Devs aren't interested in this issue
System Info
Version:
ghcr.io/huggingface/text-generation-inference:latest
GPU: Nvidia A4000 Ubuntu 22.04Information
Tasks
Reproduction
I've noticed that if it doesn't crash on the first specific request, it will crash on the second.
I spun up an A4000 with the TGI Docker container on Vast and got the same error, so it isn't just my machine.
Watch it crash.
Logs
Here's the CUDA_LAUNCH_BLOCKING=1 debug:
Command Ouput
``` 2023-08-30T17:15:17.762274Z ERROR text_generation_launcher: Method Decode encountered an error. Traceback (most recent call last): File "/opt/conda/bin/text-generation-server", line 8, in