huggingface / text-generation-inference

Large Language Model Text Generation Inference
http://hf.co/docs/text-generation-inference
Apache License 2.0
8.41k stars 957 forks source link

Phi-3 mini 128k produces gibberish if context >4k tokens #2185

Open jphme opened 2 weeks ago

jphme commented 2 weeks ago

System Info

GPU: RTX4090

Run 2.1.0 with docker like: docker run -it --rm --gpus all --ipc=host -p 8080:80 -v /home/jp/.cache/data:/data ghcr.io/huggingface/text-generation-inference:2.1.0 --model-id microsoft/Phi-3-mini-128k-instruct --max-batch-prefill-tokens=8192 --max-total-tokens=8192 --max-input-tokens=8191 --trust-remote-code --revision bb5bf1e4001277a606e11debca0ef80323e5f824 --sharded false

Information

Tasks

Reproduction

Running Phi-3 128k (the old revision as the new one fails - see #2172 ) I get good results as long as total context (input tokens + output tokens) are below 4096.

As soon as Input + Output tokens > 4096, Phi-3 outputs just gibberish, e.g. ,,..,,,,,,,,,,,,,,,,ß,,.s,ß,gen,gen,,,,s,,,,,,,,,,,,,,,,,,,,,,,,,,,o,,,,,,,,,,,,,,,,,,,,,,-hn,.,,,,,,,,,,und,,,,,,,,,,,,,,,,,,,,,,,s,,gen...,

I think there has to be some bug in the rotary embedding implementation, see also #2060 and #2055 .

Expected behavior

Inference works for longer contexts.

jphme commented 2 weeks ago

With VLLM I got the same issue initially - but then was able to figure out that this is due to the FP8 KV cache (see here) - does TGI this by default? because I didn't enable it knowingly.