Phi-3 medium 128k instruct fails to start

xfalcox commented 3 months ago

System Info

docker pull ghcr.io/huggingface/text-generation-inference:latest
latest: Pulling from huggingface/text-generation-inference
Digest: sha256:00d7f1cf3c6fce0a48ff9f2e0451cfa60a06fb48447d60dc6034f4e69443fd3e
Status: Image is up to date for ghcr.io/huggingface/text-generation-inference:latest
ghcr.io/huggingface/text-generation-inference:latest

docker run --rm --name tgi --gpus all --shm-size 1g -e HUGGING_FACE_HUB_TOKEN=xxx -p 8080:80 -v /opt/tgi-cache:/data ghcr.io/huggingface/text-generation-inference:latest --model-id microsoft/Phi-3-medium-128k-instruct
2024-05-21T19:16:29.208699Z  INFO text_generation_launcher: Args {
    model_id: "microsoft/Phi-3-medium-128k-instruct",
    revision: None,
    validation_workers: 2,
    sharded: None,
    num_shard: None,
    quantize: None,
    speculate: None,
    dtype: None,
    trust_remote_code: false,
    max_concurrent_requests: 128,
    max_best_of: 2,
    max_stop_sequences: 4,
    max_top_n_tokens: 5,
    max_input_tokens: None,
    max_input_length: None,
    max_total_tokens: None,
    waiting_served_ratio: 0.3,
    max_batch_prefill_tokens: None,
    max_batch_total_tokens: None,
    max_waiting_tokens: 20,
    max_batch_size: None,
    cuda_graphs: None,
    hostname: "a9898b15798a",
    port: 80,
    shard_uds_path: "/tmp/text-generation-server",
    master_addr: "localhost",
    master_port: 29500,
    huggingface_hub_cache: Some(
        "/data",
    ),
    weights_cache_override: None,
    disable_custom_kernels: false,
    cuda_memory_fraction: 1.0,
    rope_scaling: None,
    rope_factor: None,
    json_output: false,
    otlp_endpoint: None,
    cors_allow_origin: [],
    watermark_gamma: None,
    watermark_delta: None,
    ngrok: false,
    ngrok_authtoken: None,
    ngrok_edge: None,
    tokenizer_config_path: None,
    disable_grammar_support: false,
    env: false,
    max_client_batch_size: 4,
}
2024-05-21T19:16:29.208767Z  INFO hf_hub: Token file not found "/root/.cache/huggingface/token"    
2024-05-21T19:16:31.499304Z  INFO text_generation_launcher: Model supports up to 131072 but tgi will now set its default to 4096 instead. This is to save VRAM by refusing large prompts in order to allow more users on the same hardware. You can increase that size using `--max-batch-prefill-tokens=131122 --max-total-tokens=131072 --max-input-tokens=131071`.
2024-05-21T19:16:31.499313Z  INFO text_generation_launcher: Default `max_input_tokens` to 4095
2024-05-21T19:16:31.499316Z  INFO text_generation_launcher: Default `max_total_tokens` to 4096
2024-05-21T19:16:31.499317Z  INFO text_generation_launcher: Default `max_batch_prefill_tokens` to 4145
2024-05-21T19:16:31.499319Z  INFO text_generation_launcher: Using default cuda graphs [1, 2, 4, 8, 16, 32]
2024-05-21T19:16:31.499404Z  INFO download: text_generation_launcher: Starting download process.
2024-05-21T19:16:33.562753Z  INFO text_generation_launcher: Files are already present on the host. Skipping download.

2024-05-21T19:16:33.901453Z  INFO download: text_generation_launcher: Successfully downloaded weights.
2024-05-21T19:16:33.901584Z  INFO shard-manager: text_generation_launcher: Starting shard rank=0
2024-05-21T19:16:37.067234Z ERROR text_generation_launcher: Error when initializing model
Traceback (most recent call last):
  File "/opt/conda/bin/text-generation-server", line 8, in <module>
    sys.exit(app())
  File "/opt/conda/lib/python3.10/site-packages/typer/main.py", line 311, in __call__
    return get_command(self)(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 1157, in __call__
    return self.main(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/typer/core.py", line 778, in main
    return _main(
  File "/opt/conda/lib/python3.10/site-packages/typer/core.py", line 216, in _main
    rv = self.invoke(ctx)
  File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 1688, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 1434, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 783, in invoke
    return __callback(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/typer/main.py", line 683, in wrapper
    return callback(**use_params)  # type: ignore
  File "/opt/conda/lib/python3.10/site-packages/text_generation_server/cli.py", line 90, in serve
    server.serve(
  File "/opt/conda/lib/python3.10/site-packages/text_generation_server/server.py", line 257, in serve
    asyncio.run(
  File "/opt/conda/lib/python3.10/asyncio/runners.py", line 44, in run
    return loop.run_until_complete(main)
  File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 636, in run_until_complete
    self.run_forever()
  File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 603, in run_forever
    self._run_once()
  File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 1909, in _run_once
    handle._run()
  File "/opt/conda/lib/python3.10/asyncio/events.py", line 80, in _run
    self._context.run(self._callback, *self._args)
> File "/opt/conda/lib/python3.10/site-packages/text_generation_server/server.py", line 220, in serve_inner
    model = get_model(
  File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/__init__.py", line 423, in get_model
    return FlashLlama(
  File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/flash_llama.py", line 84, in __init__
    model = FlashLlamaForCausalLM(prefix, config, weights)
  File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/custom_modeling/flash_llama_modeling.py", line 396, in __init__
    self.model = FlashLlamaModel(prefix, config, weights)
  File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/custom_modeling/flash_llama_modeling.py", line 320, in __init__
    [
  File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/custom_modeling/flash_llama_modeling.py", line 321, in <listcomp>
    FlashLlamaLayer(
  File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/custom_modeling/flash_llama_modeling.py", line 260, in __init__
    self.self_attn = FlashLlamaAttention(
  File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/custom_modeling/flash_llama_modeling.py", line 116, in __init__
    self.query_key_value = load_attention(config, prefix, weights)
  File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/custom_modeling/flash_llama_modeling.py", line 51, in load_attention
    bias = config.attention_bias
  File "/opt/conda/lib/python3.10/site-packages/transformers/configuration_utils.py", line 263, in __getattribute__
    return super().__getattribute__(key)
AttributeError: 'Phi3Config' object has no attribute 'attention_bias'

2024-05-21T19:16:37.504771Z ERROR shard-manager: text_generation_launcher: Shard complete standard error output:

/opt/conda/lib/python3.10/site-packages/huggingface_hub/file_download.py:1132: FutureWarning: `resume_download` is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use `force_download=True`.
  warnings.warn(
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
/opt/conda/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py:658: UserWarning: You are using a Backend <class 'text_generation_server.utils.dist.FakeGroup'> as a ProcessGroup. This usage is deprecated since PyTorch 2.0. Please use a public API of PyTorch Distributed instead.
  warnings.warn(
Traceback (most recent call last):

  File "/opt/conda/bin/text-generation-server", line 8, in <module>
    sys.exit(app())

  File "/opt/conda/lib/python3.10/site-packages/text_generation_server/cli.py", line 90, in serve
    server.serve(

  File "/opt/conda/lib/python3.10/site-packages/text_generation_server/server.py", line 257, in serve
    asyncio.run(

  File "/opt/conda/lib/python3.10/asyncio/runners.py", line 44, in run
    return loop.run_until_complete(main)

  File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 649, in run_until_complete
    return future.result()

  File "/opt/conda/lib/python3.10/site-packages/text_generation_server/server.py", line 220, in serve_inner
    model = get_model(

  File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/__init__.py", line 423, in get_model
    return FlashLlama(

  File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/flash_llama.py", line 84, in __init__
    model = FlashLlamaForCausalLM(prefix, config, weights)

  File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/custom_modeling/flash_llama_modeling.py", line 396, in __init__
    self.model = FlashLlamaModel(prefix, config, weights)

  File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/custom_modeling/flash_llama_modeling.py", line 320, in __init__
    [

  File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/custom_modeling/flash_llama_modeling.py", line 321, in <listcomp>
    FlashLlamaLayer(

  File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/custom_modeling/flash_llama_modeling.py", line 260, in __init__
    self.self_attn = FlashLlamaAttention(

  File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/custom_modeling/flash_llama_modeling.py", line 116, in __init__
    self.query_key_value = load_attention(config, prefix, weights)

  File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/custom_modeling/flash_llama_modeling.py", line 51, in load_attention
    bias = config.attention_bias

  File "/opt/conda/lib/python3.10/site-packages/transformers/configuration_utils.py", line 263, in __getattribute__
    return super().__getattribute__(key)

AttributeError: 'Phi3Config' object has no attribute 'attention_bias'
 rank=0
2024-05-21T19:16:37.603743Z ERROR text_generation_launcher: Shard 0 failed to start
2024-05-21T19:16:37.603753Z  INFO text_generation_launcher: Shutting down shards
Error: ShardCannotStart

Information

[X] Docker
[ ] The CLI directly

Tasks

[X] An officially supported command
[ ] My own modifications

Reproduction

Try to run TGI Docker with latest phi-3
It fails to start

Expected behavior

Try to run TGI Docker with latest phi-3
It works

OjoDojoJo commented 3 months ago

Have you tried adding

"attention_bias": false

to the config.json?

I used a local volume to save the model and altered the config as described. It works (tested with image ghcr.io/huggingface/text-generation-inference:2.0.3).

ulrichkr commented 3 months ago

I encounter this as well. I believe it arises from the recent addition of Granite support after Phi-3 support in TGI 2.0.3. See here.

amihalik commented 3 months ago

@OjoDojoJo What's your full command line? I'm running this command on a aws g6.48xlarge

docker run -it --rm --name tgi -p 8080:80 --gpus all --shm-size 2g   \
     -v /models/:/models/ ghcr.io/huggingface/text-generation-inference:2.0.3   \
     --model-id /models/microsoft/Phi-3-medium-128k-instruct/     \
     --hostname 0.0.0.0         --trust-remote-code --num-shard 8     \
     --max-input-length=9000 --max-total-tokens=9500 \
     --max-batch-prefill-tokens=9000

And I'm getting this error:

[rank1]: Traceback (most recent call last):

[rank1]:   File "/opt/conda/bin/text-generation-server", line 8, in <module>
[rank1]:     sys.exit(app())

[rank1]:   File "/opt/conda/lib/python3.10/site-packages/text_generation_server/cli.py", line 90, in serve
[rank1]:     server.serve(

[rank1]:   File "/opt/conda/lib/python3.10/site-packages/text_generation_server/server.py", line 258, in serve
[rank1]:     asyncio.run(

[rank1]:   File "/opt/conda/lib/python3.10/asyncio/runners.py", line 44, in run
[rank1]:     return loop.run_until_complete(main)

[rank1]:   File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 649, in run_until_complete
[rank1]:     return future.result()

[rank1]:   File "/opt/conda/lib/python3.10/site-packages/text_generation_server/server.py", line 222, in serve_inner
[rank1]:     model = get_model(

[rank1]:   File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/__init__.py", line 420, in get_model
[rank1]:     return FlashLlama(

[rank1]:   File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/flash_llama.py", line 84, in __init__
[rank1]:     model = FlashLlamaForCausalLM(prefix, config, weights)

[rank1]:   File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/custom_modeling/flash_llama_modeling.py", line 368, in __init__
[rank1]:     self.model = FlashLlamaModel(prefix, config, weights)

[rank1]:   File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/custom_modeling/flash_llama_modeling.py", line 292, in __init__
[rank1]:     [

[rank1]:   File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/custom_modeling/flash_llama_modeling.py", line 293, in <listcomp>
[rank1]:     FlashLlamaLayer(

[rank1]:   File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/custom_modeling/flash_llama_modeling.py", line 232, in __init__
[rank1]:     self.self_attn = FlashLlamaAttention(

[rank1]:   File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/custom_modeling/flash_llama_modeling.py", line 108, in __init__
[rank1]:     self.query_key_value = load_attention(config, prefix, weights)

[rank1]:   File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/custom_modeling/flash_llama_modeling.py", line 45, in load_attention
[rank1]:     return TensorParallelColumnLinear.load_multi(

[rank1]:   File "/opt/conda/lib/python3.10/site-packages/text_generation_server/layers/tensor_parallel.py", line 115, in load_multi
[rank1]:     weight = weights.get_multi_weights_col(

[rank1]:   File "/opt/conda/lib/python3.10/site-packages/text_generation_server/utils/weights.py", line 264, in get_multi_weights_col
[rank1]:     w = [self.get_sharded(f"{p}.weight", dim=0) for p in prefixes]

[rank1]:   File "/opt/conda/lib/python3.10/site-packages/text_generation_server/utils/weights.py", line 264, in <listcomp>
[rank1]:     w = [self.get_sharded(f"{p}.weight", dim=0) for p in prefixes]

[rank1]:   File "/opt/conda/lib/python3.10/site-packages/text_generation_server/utils/weights.py", line 112, in get_sharded
[rank1]:     filename, tensor_name = self.get_filename(tensor_name)

[rank1]:   File "/opt/conda/lib/python3.10/site-packages/text_generation_server/utils/weights.py", line 63, in get_filename
[rank1]:     raise RuntimeError(f"weight {tensor_name} does not exist")

[rank1]: RuntimeError: weight model.layers.0.self_attn.q_proj.weight does not exist

dcbark01 commented 3 months ago

Have you tried adding

"attention_bias": false

to the config.json?

I used a local volume to save the model and altered the config as described. It works (tested with image ghcr.io/huggingface/text-generation-inference:2.0.3).

Can confirm that this works. There's currently an open PR on HF to fix the issue. In the meantime, you can run the model by directly specifying the revision. Here's my full command:

docker run --gpus all --shm-size 2g -p 8080:80 \
-v $PWD/data:/data ghcr.io/huggingface/text-generation-inference:2.0 \
--model-id microsoft/Phi-3-mini-128k-instruct \
--revision refs/pr/68 \
--trust-remote-code \
-p 8080 \
--hostname 0.0.0.0

stefanobranco commented 3 months ago

I'm still getting the same issue as @amihalik, even with the attention bias fixed:

RuntimeError: weight model.layers.0.self_attn.q_proj.weight does not exist

Not sure what causes it, I'm using pretty much the exact same docker commands.

xfalcox commented 3 months ago

Still fails for me with TGI 2.0, trust remote code, attention_bias false.

RuntimeError: weight model.layers.0.self_attn.q_proj.weight does not exist

pranavthombare commented 3 months ago

It is the same for us. tells me The argument 'trust_remote_code' is to be used with Auto classes. It has no effect here and is ignored.

github-actions[bot] commented 2 months ago

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.

huggingface / text-generation-inference