Run 2.1.0 with docker like:
docker run -it --rm --gpus all --ipc=host -p 8080:80 -v /home/jp/.cache/data:/data ghcr.io/huggingface/text-generation-inference:2.1.0 --model-id microsoft/Phi-3-mini-128k-instruct --max-batch-prefill-tokens=8192 --max-total-tokens=8192 --max-input-tokens=8191 --trust-remote-code --revision bb5bf1e4001277a606e11debca0ef80323e5f824 --sharded false
Information
[X] Docker
[ ] The CLI directly
Tasks
[X] An officially supported command
[ ] My own modifications
Reproduction
Running Phi-3 128k (the old revision as the new one fails - see #2172 ) I get good results as long as total context (input tokens + output tokens) are below 4096.
As soon as Input + Output tokens > 4096, Phi-3 outputs just gibberish, e.g.
,,..,,,,,,,,,,,,,,,,ß,,.s,ß,gen,gen,,,,s,,,,,,,,,,,,,,,,,,,,,,,,,,,o,,,,,,,,,,,,,,,,,,,,,,-hn,.,,,,,,,,,,und,,,,,,,,,,,,,,,,,,,,,,,s,,gen...,
I think there has to be some bug in the rotary embedding implementation, see also #2060 and #2055 .
With VLLM I got the same issue initially - but then was able to figure out that this is due to the FP8 KV cache (see here) - does TGI this by default? because I didn't enable it knowingly.
System Info
GPU: RTX4090
Run 2.1.0 with docker like:
docker run -it --rm --gpus all --ipc=host -p 8080:80 -v /home/jp/.cache/data:/data ghcr.io/huggingface/text-generation-inference:2.1.0 --model-id microsoft/Phi-3-mini-128k-instruct --max-batch-prefill-tokens=8192 --max-total-tokens=8192 --max-input-tokens=8191 --trust-remote-code --revision bb5bf1e4001277a606e11debca0ef80323e5f824 --sharded false
Information
Tasks
Reproduction
Running Phi-3 128k (the old revision as the new one fails - see #2172 ) I get good results as long as total context (input tokens + output tokens) are below 4096.
As soon as Input + Output tokens > 4096, Phi-3 outputs just gibberish, e.g.
,,..,,,,,,,,,,,,,,,,ß,,.s,ß,gen,gen,,,,s,,,,,,,,,,,,,,,,,,,,,,,,,,,o,,,,,,,,,,,,,,,,,,,,,,-hn,.,,,,,,,,,,und,,,,,,,,,,,,,,,,,,,,,,,s,,gen...,
I think there has to be some bug in the rotary embedding implementation, see also #2060 and #2055 .
Expected behavior
Inference works for longer contexts.