Open gugarosa opened 3 months ago
For reference, I am starting the TGI server with the following:
model=microsoft/Phi-3-mini-4k-instruct
volume=$PWD/data
docker run --gpus all \
--shm-size 1g \
-p 8080:80 \
-v $volume:/data ghcr.io/huggingface/text-generation-inference:latest \
--model-id $model \
--trust-remote-code
@nsarrazin could you please take a look into this?
Hi! thanks for digging into this, will report it internally and come back to you!
Good afternoon everyone!
We know that
Phi-3-mini-4k-instruct
has been suffering from some gibberish outputs when used with HuggingChat and I think I have been finally able to track where the issue is coming from:If I run the Python request from above, you will see that some gibberish is generated, something like:
However, if I deploy a local instance of TGI, change the
API_URL = "http://127.0.0.1:8080"
and run the very same script, the generation starts to make sense:My suspicion is that the model that has been deployed to
https://api-inference.huggingface.co/models/microsoft/Phi-3-mini-4k-instruct
, which is consumed by the HuggingChat uses an older version of code/tokenizer configuration. It was added on the release day, and we did some updates after that day.Another possibility could be an issue with a previous version of
flash-attn
(if it is being used) and somehow crashing regarding thesliding_window
? I remember some older versions had a problem where the window was not being "accurately" computed.Could you please re-deploy the model or take a look in it?
Thanks for your attention and best regards, Gustavo.