Open singh-git10 opened 1 month ago
was able to reproduce and it seems like the phi model's config changed slightly and this caused TGI to load the weights incorrectly for ref the num of heads changed
latest
{
"num_attention_heads": 40,
"num_key_value_heads": 10,
}
previous
{
"num_attention_heads": 32,
"num_key_value_heads": 32,
}
and this causes this line to be true, https://github.com/huggingface/text-generation-inference/blob/612bc483b6f502991803[…]eneration_server/models/custom_modeling/flash_llama_modeling.py (when it should follow a different path).
tldr; this PR https://github.com/huggingface/text-generation-inference/pull/1975 should resolve the attn loading issue
Thanks, Can you please also provide an update for the microsoft/Phi-3-small?
was able to reproduce and it seems like the phi model's config changed slightly and this caused TGI to load the weights incorrectly for ref the num of heads changed
latest
{ "num_attention_heads": 40, "num_key_value_heads": 10, }
previous
{ "num_attention_heads": 32, "num_key_value_heads": 32, }
and this causes this line to be true, https://github.com/huggingface/text-generation-inference/blob/612bc483b6f502991803[…]eneration_server/models/custom_modeling/flash_llama_modeling.py (when it should follow a different path).
tldr; this PR #1975 should resolve the attn loading issue
Were you able to successfully generate responses after these changes? If so, what were your generation params? I was able to get the model to load and successfully send inference requests with your suggestions, but I seem to only get gibberish output? I'm wondering if it's my generation params, or if there's something still not right on the model side? For example, if I send this request:
curl -s -X 'POST' 'http://localhost:8080/generate' -H 'accept: application/json' -H 'Content-Type: application/json' -d '{
"inputs": "<|user|>\nWhat is the modern day capital of Egypt?<|end|>\n<|assistant|>",
"parameters": {
"max_new_tokens": 20,
"temperature": 0.1,
"stop": [
"<|end|>"
]
}
}' | jq .generated_text
I get a garbage response back like:
"dndndndndndndn Vladdndndn Vladdn Vlad Vlad Vlad Vlad Vlad Vlad Vlad"
(This was using microsoft/Phi-3-medium-128k-instruct
, BTW)
I'm getting a similar error: ValueError: Unsupported model type phi3small. I've downloaded the latest Docker image as of 02.06.24.
I've tried changing "num_attention_heads": 32, "num_hidden_layers": 32, to "num_attention_heads": 40, "num_hidden_layers": 10, but it didn't help.
was able to reproduce and it seems like the phi model's config changed slightly and this caused TGI to load the weights incorrectly for ref the num of heads changed
latest
{ "num_attention_heads": 40, "num_key_value_heads": 10, }
previous
{ "num_attention_heads": 32, "num_key_value_heads": 32, }
and this causes this line to be true, https://github.com/huggingface/text-generation-inference/blob/612bc483b6f502991803[…]eneration_server/models/custom_modeling/flash_llama_modeling.py (when it should follow a different path).
tldr; this PR #1975 should resolve the attn loading issue
Were you able to successfully generate responses after these changes? If so, what were your generation params? I was able to get the model to load and successfully send inference requests with your suggestions, but I seem to only get gibberish output? I'm wondering if it's my generation params, or if there's something still not right on the model side? For example, if I send this request:
curl -s -X 'POST' 'http://localhost:8080/generate' -H 'accept: application/json' -H 'Content-Type: application/json' -d '{ "inputs": "<|user|>\nWhat is the modern day capital of Egypt?<|end|>\n<|assistant|>", "parameters": { "max_new_tokens": 20, "temperature": 0.1, "stop": [ "<|end|>" ] } }' | jq .generated_text
I get a garbage response back like:
"dndndndndndndn Vladdndndn Vladdn Vlad Vlad Vlad Vlad Vlad Vlad Vlad"
(This was using
microsoft/Phi-3-medium-128k-instruct
, BTW)
Facing the same problem as you are. No matter what request, I only seem to get garbage responses with the text "Vlad" "sten" "cu" and other random gibberish.
I'm getting the same issue as you. No matter what request I send I only get garbage responses with the text "Vlad" "sten" "cu" and other random words.
Unfortunately I've not figured a solution out yet either, though I have a suspicion it's related to the ROPE scaling.
For medium you can try: https://github.com/huggingface/text-generation-inference/pull/2039
(I still need to test 128k, but 4k works now.)
small is a different model architecture and will need its own implementation (mini and medium are quite similar to Llama).
On latest, this seems to work now as long as sharding is disabled. A soon as I enable sharding (8 shards), I get
File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/custom_modeling/flash_llama_modeling.py", line 141, in forward query, kv = qkv.split( File "/opt/conda/lib/python3.10/site-packages/torch/_tensor.py", line 921, in split return torch._VF.split_with_sizes(self, split_size, dim) RuntimeError: split_with_sizes expects split_sizes to sum exactly to 960 (input tensor's size at dimension 1), but got split_sizes=[640, 256]
@danieldk any luck in getting phi-3-small working with TGI? :(
I have also the same issue with phi-3-small
. I used the latest version of docker image, but the error still persists. Is there any success for resolving the issue?
System Info
Running a TGI 2.0.4 docker
model=microsoft/Phi-3-small-128k-instruct docker run --gpus all --shm-size 1g -p 8080:80 -v $volume:/data ghcr.io/huggingface/text-generation-inference:2.0.4 --model-id $model
model=microsoft/Phi-3-medium-128k-instruct docker run --gpus all --shm-size 1g -p 8080:80 -v $volume:/data ghcr.io/huggingface/text-generation-inference:2.0.4 --model-id $model
Information
Tasks
Reproduction
Run the provided command with a TGI 2.0.4 image. Get the following error for microsoft/Phi-3-small-128k-instruct Traceback (most recent call last): File "/opt/conda/bin/text-generation-server", line 8, in
sys.exit(app())
File "/opt/conda/lib/python3.10/site-packages/typer/main.py", line 311, in call
return get_command(self)(*args, kwargs)
File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 1157, in call
return self.main(args, kwargs)
File "/opt/conda/lib/python3.10/site-packages/typer/core.py", line 778, in main
return _main(
File "/opt/conda/lib/python3.10/site-packages/typer/core.py", line 216, in _main
rv = self.invoke(ctx)
File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 1688, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 1434, in invoke
return ctx.invoke(self.callback, ctx.params)
File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 783, in invoke
return __callback(args, kwargs)
File "/opt/conda/lib/python3.10/site-packages/typer/main.py", line 683, in wrapper
return callback(*use_params) # type: ignore
File "/opt/conda/lib/python3.10/site-packages/text_generation_server/cli.py", line 90, in serve
server.serve(
File "/opt/conda/lib/python3.10/site-packages/text_generation_server/server.py", line 257, in serve
asyncio.run(
File "/opt/conda/lib/python3.10/asyncio/runners.py", line 44, in run
return loop.run_until_complete(main)
File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 636, in run_until_complete
self.run_forever()
File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 603, in run_forever
self._run_once()
File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 1909, in _run_once
handle._run()
File "/opt/conda/lib/python3.10/asyncio/events.py", line 80, in _run
self._context.run(self._callback, self._args)
Get the following error for microsoft/Phi-3-medium-128k-instruct Traceback (most recent call last):
RuntimeError: weight model.layers.0.self_attn.q_proj.weight does not exist
Expected behavior
Expect the server to load and serve the model successfully.