Closed Johnno1011 closed 4 weeks ago
Something is bugged in your cache I think.
You are using a cache directory it seems no API specified
, meaning you're pointing to a directory not to the raw model id (if the directory has the same name as the model id, the folder takes precedence).
And that folder is simply missing the tokenizer_config.json
.
Yeah interesting point, I have tried fiddling with this... I got myself a new copy of the model and removed the caching (so that it downloads directly into the container) but this still happens. Note I moved down to llama 8B for speed.
See logs:
2024-10-17T13:03:28.026779Z INFO text_generation_launcher: Args {
model_id: "meta-llama/Meta-Llama-3.1-8B-Instruct",
revision: Some(
"0e9e39f249a16976918f6564b8830bc894c89659",
),
validation_workers: 2,
sharded: Some(
true,
),
num_shard: Some(
4,
),
quantize: Some(
BitsandbytesNf4,
),
speculate: None,
dtype: None,
trust_remote_code: false,
max_concurrent_requests: 128,
max_best_of: 2,
max_stop_sequences: 4,
max_top_n_tokens: 5,
max_input_tokens: None,
max_input_length: Some(
6144,
),
max_total_tokens: Some(
8191,
),
waiting_served_ratio: 0.3,
max_batch_prefill_tokens: Some(
6144,
),
max_batch_total_tokens: Some(
32000,
),
max_waiting_tokens: 20,
max_batch_size: None,
cuda_graphs: None,
hostname: "8e908ecad87e",
port: 80,
shard_uds_path: "/tmp/text-generation-server",
master_addr: "localhost",
master_port: 29500,
huggingface_hub_cache: None,
weights_cache_override: None,
disable_custom_kernels: false,
cuda_memory_fraction: 1.0,
rope_scaling: None,
rope_factor: None,
json_output: false,
otlp_endpoint: None,
otlp_service_name: "text-generation-inference.router",
cors_allow_origin: [],
api_key: None,
watermark_gamma: None,
watermark_delta: None,
ngrok: false,
ngrok_authtoken: None,
ngrok_edge: None,
tokenizer_config_path: None,
disable_grammar_support: false,
env: false,
max_client_batch_size: 4,
lora_adapters: None,
usage_stats: On,
}
2024-10-17T13:03:28.026876Z INFO hf_hub: Token file not found "/data/token"
2024-10-17T13:03:29.614923Z INFO text_generation_launcher: Using attention flashinfer - Prefix caching true
2024-10-17T13:03:29.614961Z WARN text_generation_launcher: Bitsandbytes doesn't work with cuda graphs, deactivating them
2024-10-17T13:03:29.614965Z INFO text_generation_launcher: Sharding model on 4 processes
2024-10-17T13:03:29.615109Z INFO download: text_generation_launcher: Starting check and download process for meta-llama/Meta-Llama-3.1-8B-Instruct
2024-10-17T13:03:33.981624Z INFO text_generation_launcher: Download file: model-00001-of-00004.safetensors
2024-10-17T13:11:25.940785Z INFO text_generation_launcher: Downloaded /data/hub/models--meta-llama--Meta-Llama-3.1-8B-Instruct/snapshots/0e9e39f249a16976918f6564b8830bc894c89659/model-00001-of-00004.safetensors in 0:07:51.
2024-10-17T13:11:25.940875Z INFO text_generation_launcher: Download: [1/4] -- ETA: 0:23:33
2024-10-17T13:11:25.941158Z INFO text_generation_launcher: Download file: model-00002-of-00004.safetensors
2024-10-17T13:19:34.950243Z INFO text_generation_launcher: Downloaded /data/hub/models--meta-llama--Meta-Llama-3.1-8B-Instruct/snapshots/0e9e39f249a16976918f6564b8830bc894c89659/model-00002-of-00004.safetensors in 0:08:09.
2024-10-17T13:19:34.950292Z INFO text_generation_launcher: Download: [2/4] -- ETA: 0:16:00
2024-10-17T13:19:34.950510Z INFO text_generation_launcher: Download file: model-00003-of-00004.safetensors
2024-10-17T13:27:38.321561Z INFO text_generation_launcher: Downloaded /data/hub/models--meta-llama--Meta-Llama-3.1-8B-Instruct/snapshots/0e9e39f249a16976918f6564b8830bc894c89659/model-00003-of-00004.safetensors in 0:08:03.
2024-10-17T13:27:38.321624Z INFO text_generation_launcher: Download: [3/4] -- ETA: 0:08:01.333333
2024-10-17T13:27:38.321805Z INFO text_generation_launcher: Download file: model-00004-of-00004.safetensors
2024-10-17T13:29:28.610014Z INFO text_generation_launcher: Downloaded /data/hub/models--meta-llama--Meta-Llama-3.1-8B-Instruct/snapshots/0e9e39f249a16976918f6564b8830bc894c89659/model-00004-of-00004.safetensors in 0:01:50.
2024-10-17T13:29:28.610074Z INFO text_generation_launcher: Download: [4/4] -- ETA: 0
2024-10-17T13:29:29.408351Z INFO download: text_generation_launcher: Successfully downloaded weights for meta-llama/Meta-Llama-3.1-8B-Instruct
2024-10-17T13:29:29.408847Z INFO shard-manager: text_generation_launcher: Starting shard rank=0
2024-10-17T13:29:29.409053Z INFO shard-manager: text_generation_launcher: Starting shard rank=2
2024-10-17T13:29:29.409106Z INFO shard-manager: text_generation_launcher: Starting shard rank=1
2024-10-17T13:29:29.427812Z INFO shard-manager: text_generation_launcher: Starting shard rank=3
2024-10-17T13:29:32.907167Z INFO text_generation_launcher: Using prefix caching = True
2024-10-17T13:29:32.907213Z INFO text_generation_launcher: Using Attention = flashinfer
2024-10-17T13:29:39.450311Z INFO shard-manager: text_generation_launcher: Waiting for shard to be ready... rank=0
2024-10-17T13:29:39.459910Z INFO shard-manager: text_generation_launcher: Waiting for shard to be ready... rank=2
2024-10-17T13:29:39.460301Z INFO shard-manager: text_generation_launcher: Waiting for shard to be ready... rank=1
2024-10-17T13:29:39.467448Z INFO shard-manager: text_generation_launcher: Waiting for shard to be ready... rank=3
2024-10-17T13:29:48.660819Z INFO text_generation_launcher: Server started at unix:///tmp/text-generation-server-0
2024-10-17T13:29:48.662461Z INFO shard-manager: text_generation_launcher: Shard ready in 19.224555922s rank=0
2024-10-17T13:29:49.046434Z INFO text_generation_launcher: Server started at unix:///tmp/text-generation-server-2
2024-10-17T13:29:49.046652Z INFO text_generation_launcher: Server started at unix:///tmp/text-generation-server-1
2024-10-17T13:29:49.046783Z INFO text_generation_launcher: Server started at unix:///tmp/text-generation-server-3
2024-10-17T13:29:49.069927Z INFO shard-manager: text_generation_launcher: Shard ready in 19.620941266s rank=2
2024-10-17T13:29:49.072665Z INFO shard-manager: text_generation_launcher: Shard ready in 19.623260049s rank=1
2024-10-17T13:29:49.078200Z INFO shard-manager: text_generation_launcher: Shard ready in 19.621143067s rank=3
2024-10-17T13:29:49.129942Z INFO text_generation_launcher: Starting Webserver
2024-10-17T13:29:49.203577Z INFO text_generation_router_v3: backends/v3/src/lib.rs:90: Warming up model
2024-10-17T13:29:52.075881Z INFO text_generation_launcher: Cuda Graphs are disabled (CUDA_GRAPHS=None).
2024-10-17T13:29:52.076421Z WARN text_generation_router_v3: backends/v3/src/lib.rs:59: `--max-batch-total-tokens` is deprecated for Flash Attention models.
2024-10-17T13:29:52.076448Z WARN text_generation_router_v3: backends/v3/src/lib.rs:63: Inferred max batch total tokens: 485257
2024-10-17T13:29:52.076454Z INFO text_generation_router_v3: backends/v3/src/lib.rs:102: Setting max batch total tokens to 485257
2024-10-17T13:29:52.076492Z INFO text_generation_router_v3: backends/v3/src/lib.rs:127: Using backend V3
2024-10-17T13:29:52.076513Z INFO text_generation_router::server: router/src/server.rs:1524: Using the Hugging Face API
2024-10-17T13:29:52.076574Z INFO hf_hub: /usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/hf-hub-0.3.2/src/lib.rs:55: Token file not found "/data/token"
2024-10-17T13:29:52.959416Z INFO text_generation_router::server: router/src/server.rs:2248: Serving revision 0e9e39f249a16976918f6564b8830bc894c89659 of model meta-llama/Llama-3.1-8B-Instruct
2024-10-17T13:29:52.959457Z WARN text_generation_router::server: router/src/server.rs:1610: Could not find tokenizer config locally and no API specified
2024-10-17T13:29:52.959466Z INFO text_generation_router::server: router/src/server.rs:1670: Using config None
2024-10-17T13:29:52.959469Z WARN text_generation_router::server: router/src/server.rs:1672: Could not find a fast tokenizer implementation for meta-llama/Meta-Llama-3.1-8B-Instruct
2024-10-17T13:29:52.959472Z WARN text_generation_router::server: router/src/server.rs:1673: Rust input length validation and truncation is disabled
2024-10-17T13:29:52.959519Z WARN text_generation_router::server: router/src/server.rs:1817: Invalid hostname, defaulting to 0.0.0.0
2024-10-17T13:29:52.962558Z INFO text_generation_router::server: router/src/server.rs:2210: Connected
I noticed that inside tgi now the models go into /data/hub by default, I think it maybe used to be /data/models--*, but I assume this is a somewhat recent HF change rather than directly in TGI.
Docker compose for transparency:
services:
llama3.1-8b-tgi:
image: ghcr.io/huggingface/text-generation-inference:2.3.1
container_name: tgi-llama3.1-8B-instruct
environment:
HUGGING_FACE_HUB_TOKEN: get_yer_own
restart: no
ports:
- 58085:80
shm_size: '1gb'
command: --model-id meta-llama/Meta-Llama-3.1-8B-Instruct --revision 0e9e39f249a16976918f6564b8830bc894c89659 --quantize bitsandbytes-nf4 --sharded true --num-shard 4 --max-batch-total-tokens 32000 --max-total-tokens 8191 --max-input-length=6144 --max-batch-prefill-tokens=6144
deploy:
resources:
reservations:
devices:
- driver: nvidia
device_ids: ['0', '1', '2', '3']
capabilities: [gpu]
Any thoughts? Thanks 😄
Hi @Johnno1011,
I think this might help.
I noticed that your model-id
is set to meta-llama/Meta-Llama-3.1-70B-Instruct
. While working with this model on the HF Hub I faced a similar issue, and I found that the model-id
has been updated to meta-llama/Llama-3.1-70B-Instruct
. Although there’s a redirect in place, some files may not be properly utilized by TGI when using the old Meta-Llama...
identifier.
Try updating your model-id
to the new one and see if that resolves the issue.
Hi @Johnno1011,
I think this might help.
I noticed that your
model-id
is set tometa-llama/Meta-Llama-3.1-70B-Instruct
. While working with this model on the HF Hub I faced a similar issue, and I found that themodel-id
has been updated tometa-llama/Llama-3.1-70B-Instruct
. Although there’s a redirect in place, some files may not be properly utilized by TGI when using the oldMeta-Llama...
identifier.Try updating your
model-id
to the new one and see if that resolves the issue.
You legend, this fixed it ! I can't believe they even changed the tag so silently like that 😢 Thank you!
System Info
text generation inference v2.3.1 meta-llama/Meta-Llama-3.1-70B-Instruct
Information
Tasks
Reproduction
Call the v1/chat/completions route with or without tools, the response will be
{ "error": "Template error: template not found", "error_type": "template_error" }
You can even call this route using the exact curl provided in the documentation here and it will still fail for the same reason. I have double checked that the model being passed into the container has a chat_template in its tokenizer_config.json. I've had a thorough look at previous issues related to this, however have been unsuccessful in finding the solution. Is it a problem within TGI itself? It's odd that the chat template is not found when it's most definitely there.
Some startup logs from my container for you:
You can clearly see from the logs that TGI isn't finding the tokenizer_config.json file, but I can't understand why this might be the case?
Please advise, Thanks!
Expected behavior
I should be able to call the /v1/chat/completions/ api route with this model.