Closed ChristophRaab closed 6 months ago
I'm able to get a bit farther if I run with a newer TGI build, eg:
docker run -it --rm --name tgi -p 8080:80 --gpus all --shm-size 1g \
ghcr.io/huggingface/text-generation-inference:sha-986b404 \
--model-id microsoft/Phi-3-mini-128k-instruct/ \
--trust-remote-code \
--num-shard $(nvidia-smi -L | wc -l)
But TGI errors out because factor
isn't set. I've tried various combinations of rope-factor
and rope-scaling
, (eg --rope-factor=32 --rope-scaling=dynamic
), but the model generates garbage.
Has anyone gotten farther with phi-3-128k? phi-3-4k works fine using the command above.
I'm able to get a bit farther if I run with a newer TGI build, eg:
docker run -it --rm --name tgi -p 8080:80 --gpus all --shm-size 1g \ ghcr.io/huggingface/text-generation-inference:sha-986b404 \ --model-id microsoft/Phi-3-mini-128k-instruct/ \ --trust-remote-code \ --num-shard $(nvidia-smi -L | wc -l)
But TGI errors out because
factor
isn't set. I've tried various combinations ofrope-factor
andrope-scaling
, (eg--rope-factor=32 --rope-scaling=dynamic
), but the model generates garbage.Has anyone gotten farther with phi-3-128k? phi-3-4k works fine using the command above.
Same issue.
Hi
I get the same error loading phi-3-128k on the latest docker:
Status: Downloaded newer image for ghcr.io/huggingface/text-generation-inference:latest
2024-04-30T11:03:01.808284Z INFO text_generation_launcher: Args {
model_id: "/home/nitro/models//microsoft_Phi-3-mini-128k-instruct",
revision: None,
validation_workers: 15,
sharded: None,
num_shard: Some(
2,
),
quantize: None,
speculate: None,
dtype: None,
trust_remote_code: true,
max_concurrent_requests: 128,
max_best_of: 2,
max_stop_sequences: 4,
max_top_n_tokens: 5,
max_input_tokens: None,
max_input_length: Some(
57344,
),
max_total_tokens: Some(
65536,
),
waiting_served_ratio: 0.3,
max_batch_prefill_tokens: Some(
57344,
),
max_batch_total_tokens: Some(
65536,
),
max_waiting_tokens: 20,
max_batch_size: None,
cuda_graphs: Some(
[
1,
2,
4,
8,
16,
32,
],
),
hostname: "0.0.0.0",
port: 80,
shard_uds_path: "/tmp/text-generation-server",
master_addr: "localhost",
master_port: 29500,
huggingface_hub_cache: Some(
"/data",
),
weights_cache_override: None,
disable_custom_kernels: false,
cuda_memory_fraction: 0.99,
rope_scaling: None,
rope_factor: None,
json_output: false,
otlp_endpoint: None,
cors_allow_origin: [],
watermark_gamma: None,
watermark_delta: None,
ngrok: false,
ngrok_authtoken: None,
ngrok_edge: None,
tokenizer_config_path: None,
disable_grammar_support: false,
env: false,
max_client_batch_size: 4,
}
2024-04-30T11:03:01.808396Z WARN text_generation_launcher: `trust_remote_code` is set. Trusting that model `/home/nitro/models//microsoft_Phi-3-mini-128k-instruct` do not contain malicious code.
2024-04-30T11:03:01.808403Z INFO text_generation_launcher: Sharding model on 2 processes
2024-04-30T11:03:01.808519Z INFO download: text_generation_launcher: Starting download process.
2024-04-30T11:03:05.695523Z INFO text_generation_launcher: Files are already present on the host. Skipping download.
2024-04-30T11:03:06.315199Z INFO download: text_generation_launcher: Successfully downloaded weights.
2024-04-30T11:03:06.315540Z INFO shard-manager: text_generation_launcher: Starting shard rank=0
2024-04-30T11:03:06.315622Z INFO shard-manager: text_generation_launcher: Starting shard rank=1
2024-04-30T11:03:12.513292Z ERROR text_generation_launcher: Error when initializing model
Traceback (most recent call last):
File "/opt/conda/bin/text-generation-server", line 8, in <module>
sys.exit(app())
File "/opt/conda/lib/python3.10/site-packages/typer/main.py", line 311, in __call__
return get_command(self)(*args, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 1157, in __call__
return self.main(*args, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/typer/core.py", line 778, in main
return _main(
File "/opt/conda/lib/python3.10/site-packages/typer/core.py", line 216, in _main
rv = self.invoke(ctx)
File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 1688, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 1434, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 783, in invoke
return __callback(*args, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/typer/main.py", line 683, in wrapper
return callback(**use_params) # type: ignore
File "/opt/conda/lib/python3.10/site-packages/text_generation_server/cli.py", line 90, in serve
server.serve(
File "/opt/conda/lib/python3.10/site-packages/text_generation_server/server.py", line 253, in serve
asyncio.run(
File "/opt/conda/lib/python3.10/asyncio/runners.py", line 44, in run
return loop.run_until_complete(main)
File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 636, in run_until_complete
self.run_forever()
File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 603, in run_forever
self._run_once()
File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 1909, in _run_once
handle._run()
File "/opt/conda/lib/python3.10/asyncio/events.py", line 80, in _run
self._context.run(self._callback, *self._args)
> File "/opt/conda/lib/python3.10/site-packages/text_generation_server/server.py", line 217, in serve_inner
model = get_model(
File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/__init__.py", line 333, in get_model
return FlashLlama(
File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/flash_llama.py", line 84, in __init__
model = FlashLlamaForCausalLM(prefix, config, weights)
File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/custom_modeling/flash_llama_modeling.py", line 385, in __init__
self.model = FlashLlamaModel(prefix, config, weights)
File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/custom_modeling/flash_llama_modeling.py", line 309, in __init__
[
File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/custom_modeling/flash_llama_modeling.py", line 310, in <listcomp>
FlashLlamaLayer(
File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/custom_modeling/flash_llama_modeling.py", line 249, in __init__
self.self_attn = FlashLlamaAttention(
File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/custom_modeling/flash_llama_modeling.py", line 107, in __init__
self.rotary_emb = PositionRotaryEmbedding.static(
File "/opt/conda/lib/python3.10/site-packages/text_generation_server/utils/layers.py", line 1032, in static
scaling_factor = rope_scaling["factor"]
KeyError: 'factor'
Thank you for all the work on TGI
I am able to run the model with the following command on 2.0.2:
text-generation-launcher --model-id=microsoft/Phi-3-mini-128k-instruct --port=80 --trust-remote-code --rope-factor=32 --rope-scaling=dynamic
However, i receive the warning:
2024-05-02T10:09:32.001826Z WARN text_generation_router: router/src/main.rs:266: Could not parse config Error("unknown variant `phi3`, expected one of `llava_next`, `clip_vision_model`, `mistral`, `idefics`, `idefics2`, `ssm`, `gpt_bigcode`, `santacoder`, `bloom`, `mpt`, `gpt_neox`, `phi`, `phi-msft`, `llama`, `baichuan`, `gemma`, `cohere`, `drbx`, `falcon`, `mixtral`, `starcoder2`, `qwen2`, `opt`, `t5`", line: 19, column: 22)
@Narsil because you added support for phi3, it the above warning may is interesting to you.
Model description
Hi all,
currently, the microsoft/Phi-3-mini-128k-instruct is not supported by text-generation-inference. As displayed in the following error:
The server is started with the following config:
Are there any plans to further support it?
Best wishes Christoph
Open source status
Provide useful links for the implementation
The link to the model is https://huggingface.co/microsoft/Phi-3-mini-128k-instruct