huggingface / text-generation-inference

Large Language Model Text Generation Inference
http://hf.co/docs/text-generation-inference
Apache License 2.0
8.36k stars 948 forks source link

Cannot load microsoft/Phi-3-medium and microsoft/Phi-3-small with TGI-2.0.4 #1974

Open singh-git10 opened 1 month ago

singh-git10 commented 1 month ago

System Info

Running a TGI 2.0.4 docker

model=microsoft/Phi-3-small-128k-instruct docker run --gpus all --shm-size 1g -p 8080:80 -v $volume:/data ghcr.io/huggingface/text-generation-inference:2.0.4 --model-id $model

model=microsoft/Phi-3-medium-128k-instruct docker run --gpus all --shm-size 1g -p 8080:80 -v $volume:/data ghcr.io/huggingface/text-generation-inference:2.0.4 --model-id $model

Information

Tasks

Reproduction

Run the provided command with a TGI 2.0.4 image. Get the following error for microsoft/Phi-3-small-128k-instruct Traceback (most recent call last): File "/opt/conda/bin/text-generation-server", line 8, in sys.exit(app()) File "/opt/conda/lib/python3.10/site-packages/typer/main.py", line 311, in call return get_command(self)(*args, kwargs) File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 1157, in call return self.main(args, kwargs) File "/opt/conda/lib/python3.10/site-packages/typer/core.py", line 778, in main return _main( File "/opt/conda/lib/python3.10/site-packages/typer/core.py", line 216, in _main rv = self.invoke(ctx) File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 1688, in invoke return _process_result(sub_ctx.command.invoke(sub_ctx)) File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 1434, in invoke return ctx.invoke(self.callback, ctx.params) File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 783, in invoke return __callback(args, kwargs) File "/opt/conda/lib/python3.10/site-packages/typer/main.py", line 683, in wrapper return callback(*use_params) # type: ignore File "/opt/conda/lib/python3.10/site-packages/text_generation_server/cli.py", line 90, in serve server.serve( File "/opt/conda/lib/python3.10/site-packages/text_generation_server/server.py", line 257, in serve asyncio.run( File "/opt/conda/lib/python3.10/asyncio/runners.py", line 44, in run return loop.run_until_complete(main) File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 636, in run_until_complete self.run_forever() File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 603, in run_forever self._run_once() File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 1909, in _run_once handle._run() File "/opt/conda/lib/python3.10/asyncio/events.py", line 80, in _run self._context.run(self._callback, self._args)

File "/opt/conda/lib/python3.10/site-packages/text_generation_server/server.py", line 220, in serve_inner model = get_model( File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/init.py", line 908, in get_model raise ValueError(f"Unsupported model type {model_type}") ValueError: Unsupported model type phi3small

Get the following error for microsoft/Phi-3-medium-128k-instruct Traceback (most recent call last):

File "/opt/conda/bin/text-generation-server", line 8, in <module>
  sys.exit(app())

File "/opt/conda/lib/python3.10/site-packages/text_generation_server/cli.py", line 90, in serve
  server.serve(

File "/opt/conda/lib/python3.10/site-packages/text_generation_server/server.py", line 257, in serve
  asyncio.run(

File "/opt/conda/lib/python3.10/asyncio/runners.py", line 44, in run
  return loop.run_until_complete(main)

File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 649, in run_until_complete
  return future.result()

File "/opt/conda/lib/python3.10/site-packages/text_generation_server/server.py", line 220, in serve_inner
  model = get_model(

File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/__init__.py", line 560, in get_model
  return FlashLlama(

File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/flash_llama.py", line 84, in __init__
  model = FlashLlamaForCausalLM(prefix, config, weights)

File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/custom_modeling/flash_llama_modeling.py", line 396, in __init__
  self.model = FlashLlamaModel(prefix, config, weights)

File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/custom_modeling/flash_llama_modeling.py", line 320, in __init__
  [

File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/custom_modeling/flash_llama_modeling.py", line 321, in <listcomp>
  FlashLlamaLayer(

File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/custom_modeling/flash_llama_modeling.py", line 260, in __init__
  self.self_attn = FlashLlamaAttention(

File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/custom_modeling/flash_llama_modeling.py", line 116, in __init__
  self.query_key_value = load_attention(config, prefix, weights)

File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/custom_modeling/flash_llama_modeling.py", line 53, in load_attention
  return TensorParallelColumnLinear.load_multi(

File "/opt/conda/lib/python3.10/site-packages/text_generation_server/layers/tensor_parallel.py", line 115, in load_multi
  weight = weights.get_multi_weights_col(

File "/opt/conda/lib/python3.10/site-packages/text_generation_server/utils/weights.py", line 264, in get_multi_weights_col
  w = [self.get_sharded(f"{p}.weight", dim=0) for p in prefixes]

File "/opt/conda/lib/python3.10/site-packages/text_generation_server/utils/weights.py", line 264, in <listcomp>
  w = [self.get_sharded(f"{p}.weight", dim=0) for p in prefixes]

File "/opt/conda/lib/python3.10/site-packages/text_generation_server/utils/weights.py", line 112, in get_sharded
  filename, tensor_name = self.get_filename(tensor_name)

File "/opt/conda/lib/python3.10/site-packages/text_generation_server/utils/weights.py", line 63, in get_filename
  raise RuntimeError(f"weight {tensor_name} does not exist")

RuntimeError: weight model.layers.0.self_attn.q_proj.weight does not exist

Expected behavior

Expect the server to load and serve the model successfully.

drbh commented 1 month ago

was able to reproduce and it seems like the phi model's config changed slightly and this caused TGI to load the weights incorrectly for ref the num of heads changed

latest

{
  "num_attention_heads": 40,
  "num_key_value_heads": 10,
}

previous

{
  "num_attention_heads": 32,
  "num_key_value_heads": 32,
}

and this causes this line to be true, https://github.com/huggingface/text-generation-inference/blob/612bc483b6f502991803[…]eneration_server/models/custom_modeling/flash_llama_modeling.py (when it should follow a different path).

tldr; this PR https://github.com/huggingface/text-generation-inference/pull/1975 should resolve the attn loading issue

singh-git10 commented 1 month ago

Thanks, Can you please also provide an update for the microsoft/Phi-3-small?

dcbark01 commented 1 month ago

was able to reproduce and it seems like the phi model's config changed slightly and this caused TGI to load the weights incorrectly for ref the num of heads changed

latest

{
  "num_attention_heads": 40,
  "num_key_value_heads": 10,
}

previous

{
  "num_attention_heads": 32,
  "num_key_value_heads": 32,
}

and this causes this line to be true, https://github.com/huggingface/text-generation-inference/blob/612bc483b6f502991803[…]eneration_server/models/custom_modeling/flash_llama_modeling.py (when it should follow a different path).

tldr; this PR #1975 should resolve the attn loading issue

Were you able to successfully generate responses after these changes? If so, what were your generation params? I was able to get the model to load and successfully send inference requests with your suggestions, but I seem to only get gibberish output? I'm wondering if it's my generation params, or if there's something still not right on the model side? For example, if I send this request:

curl -s -X 'POST' 'http://localhost:8080/generate'   -H 'accept: application/json'   -H 'Content-Type: application/json'   -d '{
  "inputs": "<|user|>\nWhat is the modern day capital of Egypt?<|end|>\n<|assistant|>",
  "parameters": {
    "max_new_tokens": 20,
    "temperature": 0.1,
    "stop": [
      "<|end|>"
    ]
  }
}' | jq .generated_text

I get a garbage response back like:

"dndndndndndndn Vladdndndn Vladdn Vlad Vlad Vlad Vlad Vlad Vlad Vlad"

(This was using microsoft/Phi-3-medium-128k-instruct, BTW)

sklyar61 commented 1 month ago

I'm getting a similar error: ValueError: Unsupported model type phi3small. I've downloaded the latest Docker image as of 02.06.24.

I've tried changing "num_attention_heads": 32, "num_hidden_layers": 32, to "num_attention_heads": 40, "num_hidden_layers": 10, but it didn't help.

IMeS-GH commented 1 month ago

was able to reproduce and it seems like the phi model's config changed slightly and this caused TGI to load the weights incorrectly for ref the num of heads changed

latest

{
  "num_attention_heads": 40,
  "num_key_value_heads": 10,
}

previous

{
  "num_attention_heads": 32,
  "num_key_value_heads": 32,
}

and this causes this line to be true, https://github.com/huggingface/text-generation-inference/blob/612bc483b6f502991803[…]eneration_server/models/custom_modeling/flash_llama_modeling.py (when it should follow a different path).

tldr; this PR #1975 should resolve the attn loading issue

Were you able to successfully generate responses after these changes? If so, what were your generation params? I was able to get the model to load and successfully send inference requests with your suggestions, but I seem to only get gibberish output? I'm wondering if it's my generation params, or if there's something still not right on the model side? For example, if I send this request:

curl -s -X 'POST' 'http://localhost:8080/generate'   -H 'accept: application/json'   -H 'Content-Type: application/json'   -d '{
  "inputs": "<|user|>\nWhat is the modern day capital of Egypt?<|end|>\n<|assistant|>",
  "parameters": {
    "max_new_tokens": 20,
    "temperature": 0.1,
    "stop": [
      "<|end|>"
    ]
  }
}' | jq .generated_text

I get a garbage response back like:

"dndndndndndndn Vladdndndn Vladdn Vlad Vlad Vlad Vlad Vlad Vlad Vlad"

(This was using microsoft/Phi-3-medium-128k-instruct, BTW)

Facing the same problem as you are. No matter what request, I only seem to get garbage responses with the text "Vlad" "sten" "cu" and other random gibberish.

dcbark01 commented 1 month ago

I'm getting the same issue as you. No matter what request I send I only get garbage responses with the text "Vlad" "sten" "cu" and other random words.

Unfortunately I've not figured a solution out yet either, though I have a suspicion it's related to the ROPE scaling.

danieldk commented 1 month ago

For medium you can try: https://github.com/huggingface/text-generation-inference/pull/2039

(I still need to test 128k, but 4k works now.)

small is a different model architecture and will need its own implementation (mini and medium are quite similar to Llama).

stefanobranco commented 3 weeks ago

On latest, this seems to work now as long as sharding is disabled. A soon as I enable sharding (8 shards), I get

File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/custom_modeling/flash_llama_modeling.py", line 141, in forward query, kv = qkv.split( File "/opt/conda/lib/python3.10/site-packages/torch/_tensor.py", line 921, in split return torch._VF.split_with_sizes(self, split_size, dim) RuntimeError: split_with_sizes expects split_sizes to sum exactly to 960 (input tensor's size at dimension 1), but got split_sizes=[640, 256]

aveerago commented 2 weeks ago

@danieldk any luck in getting phi-3-small working with TGI? :(

farzanehnakhaee70 commented 1 week ago

I have also the same issue with phi-3-small. I used the latest version of docker image, but the error still persists. Is there any success for resolving the issue?