Closed xfalcox closed 2 months ago
Have you tried adding
"attention_bias": false
to the config.json?
I used a local volume to save the model and altered the config as described. It works (tested with image ghcr.io/huggingface/text-generation-inference:2.0.3).
I encounter this as well. I believe it arises from the recent addition of Granite support after Phi-3 support in TGI 2.0.3. See here.
@OjoDojoJo What's your full command line? I'm running this command on a aws g6.48xlarge
docker run -it --rm --name tgi -p 8080:80 --gpus all --shm-size 2g \
-v /models/:/models/ ghcr.io/huggingface/text-generation-inference:2.0.3 \
--model-id /models/microsoft/Phi-3-medium-128k-instruct/ \
--hostname 0.0.0.0 --trust-remote-code --num-shard 8 \
--max-input-length=9000 --max-total-tokens=9500 \
--max-batch-prefill-tokens=9000
And I'm getting this error:
[rank1]: Traceback (most recent call last):
[rank1]: File "/opt/conda/bin/text-generation-server", line 8, in <module>
[rank1]: sys.exit(app())
[rank1]: File "/opt/conda/lib/python3.10/site-packages/text_generation_server/cli.py", line 90, in serve
[rank1]: server.serve(
[rank1]: File "/opt/conda/lib/python3.10/site-packages/text_generation_server/server.py", line 258, in serve
[rank1]: asyncio.run(
[rank1]: File "/opt/conda/lib/python3.10/asyncio/runners.py", line 44, in run
[rank1]: return loop.run_until_complete(main)
[rank1]: File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 649, in run_until_complete
[rank1]: return future.result()
[rank1]: File "/opt/conda/lib/python3.10/site-packages/text_generation_server/server.py", line 222, in serve_inner
[rank1]: model = get_model(
[rank1]: File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/__init__.py", line 420, in get_model
[rank1]: return FlashLlama(
[rank1]: File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/flash_llama.py", line 84, in __init__
[rank1]: model = FlashLlamaForCausalLM(prefix, config, weights)
[rank1]: File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/custom_modeling/flash_llama_modeling.py", line 368, in __init__
[rank1]: self.model = FlashLlamaModel(prefix, config, weights)
[rank1]: File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/custom_modeling/flash_llama_modeling.py", line 292, in __init__
[rank1]: [
[rank1]: File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/custom_modeling/flash_llama_modeling.py", line 293, in <listcomp>
[rank1]: FlashLlamaLayer(
[rank1]: File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/custom_modeling/flash_llama_modeling.py", line 232, in __init__
[rank1]: self.self_attn = FlashLlamaAttention(
[rank1]: File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/custom_modeling/flash_llama_modeling.py", line 108, in __init__
[rank1]: self.query_key_value = load_attention(config, prefix, weights)
[rank1]: File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/custom_modeling/flash_llama_modeling.py", line 45, in load_attention
[rank1]: return TensorParallelColumnLinear.load_multi(
[rank1]: File "/opt/conda/lib/python3.10/site-packages/text_generation_server/layers/tensor_parallel.py", line 115, in load_multi
[rank1]: weight = weights.get_multi_weights_col(
[rank1]: File "/opt/conda/lib/python3.10/site-packages/text_generation_server/utils/weights.py", line 264, in get_multi_weights_col
[rank1]: w = [self.get_sharded(f"{p}.weight", dim=0) for p in prefixes]
[rank1]: File "/opt/conda/lib/python3.10/site-packages/text_generation_server/utils/weights.py", line 264, in <listcomp>
[rank1]: w = [self.get_sharded(f"{p}.weight", dim=0) for p in prefixes]
[rank1]: File "/opt/conda/lib/python3.10/site-packages/text_generation_server/utils/weights.py", line 112, in get_sharded
[rank1]: filename, tensor_name = self.get_filename(tensor_name)
[rank1]: File "/opt/conda/lib/python3.10/site-packages/text_generation_server/utils/weights.py", line 63, in get_filename
[rank1]: raise RuntimeError(f"weight {tensor_name} does not exist")
[rank1]: RuntimeError: weight model.layers.0.self_attn.q_proj.weight does not exist
Have you tried adding
"attention_bias": false
to the config.json?
I used a local volume to save the model and altered the config as described. It works (tested with image ghcr.io/huggingface/text-generation-inference:2.0.3).
Can confirm that this works. There's currently an open PR on HF to fix the issue. In the meantime, you can run the model by directly specifying the revision. Here's my full command:
docker run --gpus all --shm-size 2g -p 8080:80 \
-v $PWD/data:/data ghcr.io/huggingface/text-generation-inference:2.0 \
--model-id microsoft/Phi-3-mini-128k-instruct \
--revision refs/pr/68 \
--trust-remote-code \
-p 8080 \
--hostname 0.0.0.0
I'm still getting the same issue as @amihalik, even with the attention bias fixed:
RuntimeError: weight model.layers.0.self_attn.q_proj.weight does not exist
Not sure what causes it, I'm using pretty much the exact same docker commands.
Still fails for me with TGI 2.0, trust remote code, attention_bias false.
RuntimeError: weight model.layers.0.self_attn.q_proj.weight does not exist
It is the same for us. tells me
The argument 'trust_remote_code' is to be used with Auto classes. It has no effect here and is ignored.
This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.
System Info
Information
Tasks
Reproduction
Expected behavior