huggingface / text-generation-inference

Large Language Model Text Generation Inference
http://hf.co/docs/text-generation-inference
Apache License 2.0
8.06k stars 881 forks source link

Add support for Phi-3 Model #1807

Closed ChristophRaab closed 2 weeks ago

ChristophRaab commented 3 weeks ago

Model description

Hi all,

currently, the microsoft/Phi-3-mini-128k-instruct is not supported by text-generation-inference. As displayed in the following error:

2024-04-25T12:45:45.282234Z ERROR shard-manager: text_generation_launcher: Shard complete standard error output:

Traceback (most recent call last):

  File "/opt/conda/bin/text-generation-server", line 8, in <module>
    sys.exit(app())

  File "/opt/conda/lib/python3.10/site-packages/text_generation_server/cli.py", line 90, in serve
    server.serve(

  File "/opt/conda/lib/python3.10/site-packages/text_generation_server/server.py", line 240, in serve
    asyncio.run(

  File "/opt/conda/lib/python3.10/asyncio/runners.py", line 44, in run
    return loop.run_until_complete(main)

  File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 649, in run_until_complete
    return future.result()

  File "/opt/conda/lib/python3.10/site-packages/text_generation_server/server.py", line 201, in serve_inner
    model = get_model(

  File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/__init__.py", line 648, in get_model
    raise ValueError(f"Unsupported model type {model_type}")

ValueError: Unsupported model type phi3

The server is started with the following config:

text_generation_launcher: Args { model_id: "microsoft/Phi-3-mini-128k-instruct", revision: None, validation_workers: 2, sharded: None, num_shard: None, quantize: None, speculate: None, dtype: None, trust_remote_code: false, max_concurrent_requests: 128, 
max_best_of: 2, max_stop_sequences: 4, max_top_n_tokens: 5, max_input_tokens: None, max_input_length: None, max_total_tokens: None, waiting_served_ratio: 1.2, max_batch_prefill_tokens: None, max_batch_total_tokens: None, max_waiting_tokens: 20, max_batch_size: None, cuda_graphs: None, hostname: "tgi-phi-deployment-6c75c84cf9-qsbh5", port: 80, shard_uds_path: "/tmp/text-generation-server", master_addr: "localhost", master_port: 29500, huggingface_hub_cache: Some("/data"), weights_cache_override: None, disable_custom_kernels: false, cuda_memory_fraction: 1.0, rope_scaling: None, rope_factor: None, json_output: false, otlp_endpoint: None, cors_allow_origin: [], watermark_gamma: None, watermark_delta: None, ngrok: 
false, ngrok_authtoken: None, ngrok_edge: None, tokenizer_config_path: None, disable_grammar_support: false, env: false, max_client_batch_size: 4 }

Are there any plans to further support it?

Best wishes Christoph

Open source status

Provide useful links for the implementation

The link to the model is https://huggingface.co/microsoft/Phi-3-mini-128k-instruct

amihalik commented 3 weeks ago

I'm able to get a bit farther if I run with a newer TGI build, eg:

docker run -it --rm --name tgi -p 8080:80 --gpus all --shm-size 1g  \
    ghcr.io/huggingface/text-generation-inference:sha-986b404 \
    --model-id microsoft/Phi-3-mini-128k-instruct/ \
    --trust-remote-code \
    --num-shard $(nvidia-smi -L | wc -l) 

But TGI errors out because factor isn't set. I've tried various combinations of rope-factor and rope-scaling, (eg --rope-factor=32 --rope-scaling=dynamic), but the model generates garbage.

Has anyone gotten farther with phi-3-128k? phi-3-4k works fine using the command above.

RonanKMcGovern commented 3 weeks ago

I'm able to get a bit farther if I run with a newer TGI build, eg:

docker run -it --rm --name tgi -p 8080:80 --gpus all --shm-size 1g  \
    ghcr.io/huggingface/text-generation-inference:sha-986b404 \
    --model-id microsoft/Phi-3-mini-128k-instruct/ \
    --trust-remote-code \
    --num-shard $(nvidia-smi -L | wc -l) 

But TGI errors out because factor isn't set. I've tried various combinations of rope-factor and rope-scaling, (eg --rope-factor=32 --rope-scaling=dynamic), but the model generates garbage.

Has anyone gotten farther with phi-3-128k? phi-3-4k works fine using the command above.

Same issue.

nitronomic commented 2 weeks ago

Hi

I get the same error loading phi-3-128k on the latest docker:

Status: Downloaded newer image for ghcr.io/huggingface/text-generation-inference:latest
2024-04-30T11:03:01.808284Z  INFO text_generation_launcher: Args {
    model_id: "/home/nitro/models//microsoft_Phi-3-mini-128k-instruct",
    revision: None,
    validation_workers: 15,
    sharded: None,
    num_shard: Some(
        2,
    ),
    quantize: None,
    speculate: None,
    dtype: None,
    trust_remote_code: true,
    max_concurrent_requests: 128,
    max_best_of: 2,
    max_stop_sequences: 4,
    max_top_n_tokens: 5,
    max_input_tokens: None,
    max_input_length: Some(
        57344,
    ),
    max_total_tokens: Some(
        65536,
    ),
    waiting_served_ratio: 0.3,
    max_batch_prefill_tokens: Some(
        57344,
    ),
    max_batch_total_tokens: Some(
        65536,
    ),
    max_waiting_tokens: 20,
    max_batch_size: None,
    cuda_graphs: Some(
        [
            1,
            2,
            4,
            8,
            16,
            32,
        ],
    ),
    hostname: "0.0.0.0",
    port: 80,
    shard_uds_path: "/tmp/text-generation-server",
    master_addr: "localhost",
    master_port: 29500,
    huggingface_hub_cache: Some(
        "/data",
    ),
    weights_cache_override: None,
    disable_custom_kernels: false,
    cuda_memory_fraction: 0.99,
    rope_scaling: None,
    rope_factor: None,
    json_output: false,
    otlp_endpoint: None,
    cors_allow_origin: [],
    watermark_gamma: None,
    watermark_delta: None,
    ngrok: false,
    ngrok_authtoken: None,
    ngrok_edge: None,
    tokenizer_config_path: None,
    disable_grammar_support: false,
    env: false,
    max_client_batch_size: 4,
}
2024-04-30T11:03:01.808396Z  WARN text_generation_launcher: `trust_remote_code` is set. Trusting that model `/home/nitro/models//microsoft_Phi-3-mini-128k-instruct` do not contain malicious code.
2024-04-30T11:03:01.808403Z  INFO text_generation_launcher: Sharding model on 2 processes
2024-04-30T11:03:01.808519Z  INFO download: text_generation_launcher: Starting download process.
2024-04-30T11:03:05.695523Z  INFO text_generation_launcher: Files are already present on the host. Skipping download.

2024-04-30T11:03:06.315199Z  INFO download: text_generation_launcher: Successfully downloaded weights.
2024-04-30T11:03:06.315540Z  INFO shard-manager: text_generation_launcher: Starting shard rank=0
2024-04-30T11:03:06.315622Z  INFO shard-manager: text_generation_launcher: Starting shard rank=1
2024-04-30T11:03:12.513292Z ERROR text_generation_launcher: Error when initializing model
Traceback (most recent call last):
  File "/opt/conda/bin/text-generation-server", line 8, in <module>
    sys.exit(app())
  File "/opt/conda/lib/python3.10/site-packages/typer/main.py", line 311, in __call__
    return get_command(self)(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 1157, in __call__
    return self.main(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/typer/core.py", line 778, in main
    return _main(
  File "/opt/conda/lib/python3.10/site-packages/typer/core.py", line 216, in _main
    rv = self.invoke(ctx)
  File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 1688, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 1434, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 783, in invoke
    return __callback(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/typer/main.py", line 683, in wrapper
    return callback(**use_params)  # type: ignore
  File "/opt/conda/lib/python3.10/site-packages/text_generation_server/cli.py", line 90, in serve
    server.serve(
  File "/opt/conda/lib/python3.10/site-packages/text_generation_server/server.py", line 253, in serve
    asyncio.run(
  File "/opt/conda/lib/python3.10/asyncio/runners.py", line 44, in run
    return loop.run_until_complete(main)
  File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 636, in run_until_complete
    self.run_forever()
  File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 603, in run_forever
    self._run_once()
  File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 1909, in _run_once
    handle._run()
  File "/opt/conda/lib/python3.10/asyncio/events.py", line 80, in _run
    self._context.run(self._callback, *self._args)
> File "/opt/conda/lib/python3.10/site-packages/text_generation_server/server.py", line 217, in serve_inner
    model = get_model(
  File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/__init__.py", line 333, in get_model
    return FlashLlama(
  File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/flash_llama.py", line 84, in __init__
    model = FlashLlamaForCausalLM(prefix, config, weights)
  File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/custom_modeling/flash_llama_modeling.py", line 385, in __init__
    self.model = FlashLlamaModel(prefix, config, weights)
  File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/custom_modeling/flash_llama_modeling.py", line 309, in __init__
    [
  File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/custom_modeling/flash_llama_modeling.py", line 310, in <listcomp>
    FlashLlamaLayer(
  File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/custom_modeling/flash_llama_modeling.py", line 249, in __init__
    self.self_attn = FlashLlamaAttention(
  File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/custom_modeling/flash_llama_modeling.py", line 107, in __init__
    self.rotary_emb = PositionRotaryEmbedding.static(
  File "/opt/conda/lib/python3.10/site-packages/text_generation_server/utils/layers.py", line 1032, in static
    scaling_factor = rope_scaling["factor"]
KeyError: 'factor'

Thank you for all the work on TGI

ChristophRaab commented 2 weeks ago

I am able to run the model with the following command on 2.0.2:

text-generation-launcher --model-id=microsoft/Phi-3-mini-128k-instruct --port=80  --trust-remote-code --rope-factor=32  --rope-scaling=dynamic

However, i receive the warning:

2024-05-02T10:09:32.001826Z  WARN text_generation_router: router/src/main.rs:266: Could not parse config Error("unknown variant `phi3`, expected one of `llava_next`, `clip_vision_model`, `mistral`, `idefics`, `idefics2`, `ssm`, `gpt_bigcode`, `santacoder`, `bloom`, `mpt`, `gpt_neox`, `phi`, `phi-msft`, `llama`, `baichuan`, `gemma`, `cohere`, `drbx`, `falcon`, `mixtral`, `starcoder2`, `qwen2`, `opt`, `t5`", line: 19, column: 22) 

@Narsil because you added support for phi3, it the above warning may is interesting to you.