huggingface / text-generation-inference

Large Language Model Text Generation Inference
http://hf.co/docs/text-generation-inference
Apache License 2.0
8.85k stars 1.04k forks source link

Template Not Found When Using OpenAI format Chat Completion #1545

Closed binarycrayon closed 5 months ago

binarycrayon commented 7 months ago

System Info

Docker Image: ghcr.io/huggingface/text-generation-inference:sha-1734540 Instance: AWS A10G via Huggingface Interfence Endpoint

Information

Tasks

Reproduction

On Huggingface Inference Endpoint:

Served a finetuned model: JamAndTeaStudios/dialogue-choice-merged-01-30-sft-mistral-7b-instruct-0.2 Above model is a peft fine-tuned and merged Mistral 7B model, including tokenizer Task: Text Generation TGI Docker Image: ghcr.io/huggingface/text-generation-inference:sha-1734540 Instance: AWS GPU A10G in east region

Expected behavior

Once the inference url is up and running, I followed https://huggingface.co/blog/tgi-messages-api and configured Openai client with the URL then call the chat completion endpoint

saw 402 error:

UnprocessableEntityError: Error code: 422 - {'error': 'Template error: template not found', 'error_type': 'template_error'}

log from TGI endpoint:

2024/02/09 11:13:20 ~ {"timestamp":"2024-02-09T19:13:20.293485Z","level":"INFO","fields":{"message":"Shard ready in 5.304520806s"},"target":"text_generation_launcher","span":{"rank":0,"name":"shard-manager"},"spans":[{"rank":0,"name":"shard-manager"}]}
2024/02/09 11:13:20 ~ {"timestamp":"2024-02-09T19:13:20.392520Z","level":"INFO","fields":{"message":"Starting Webserver"},"target":"text_generation_launcher"}
2024/02/09 11:13:20 ~ {"timestamp":"2024-02-09T19:13:20.435457Z","level":"INFO","message":"Using the Hugging Face API","target":"text_generation_router","filename":"router/src/main.rs","line_number":175}
2024/02/09 11:13:20 ~ {"timestamp":"2024-02-09T19:13:20.435497Z","level":"INFO","message":"Token file not found \"/root/.cache/huggingface/token\"","log.target":"hf_hub","log.module_path":"hf_hub","log.file":"/usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/hf-hub-0.3.2/src/lib.rs","log.line":55,"target":"hf_hub","filename":"/usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/hf-hub-0.3.2/src/lib.rs","line_number":55}
2024/02/09 11:13:20 ~ {"timestamp":"2024-02-09T19:13:20.927432Z","level":"INFO","message":"Serving revision ac2ae5fab2ce3f9f40dc79b5ca9f637430d24971 of model bigscience/bloom-560m","target":"text_generation_router","filename":"router/src/main.rs","line_number":425}
2024/02/09 11:13:20 ~ {"timestamp":"2024-02-09T19:13:20.927459Z","level":"INFO","message":"Using the Hugging Face API to retrieve tokenizer config","target":"text_generation_router","filename":"router/src/main.rs","line_number":236}
2024/02/09 11:13:20 ~ {"timestamp":"2024-02-09T19:13:20.931578Z","level":"INFO","message":"Warming up model","target":"text_generation_router","filename":"router/src/main.rs","line_number":285}
2024/02/09 11:13:22 ~ {"timestamp":"2024-02-09T19:13:22.161431Z","level":"WARN","message":"Model does not support automatic max batch total tokens","target":"text_generation_router","filename":"router/src/main.rs","line_number":299}
2024/02/09 11:13:22 ~ {"timestamp":"2024-02-09T19:13:22.161459Z","level":"INFO","message":"Setting max batch total tokens to 16000","target":"text_generation_router","filename":"router/src/main.rs","line_number":321}
2024/02/09 11:13:22 ~ {"timestamp":"2024-02-09T19:13:22.161463Z","level":"INFO","message":"Connected","target":"text_generation_router","filename":"router/src/main.rs","line_number":322}
2024/02/09 11:13:22 ~ {"timestamp":"2024-02-09T19:13:22.161467Z","level":"WARN","message":"Invalid hostname, defaulting to 0.0.0.0","target":"text_generation_router","filename":"router/src/main.rs","line_number":327}
2024/02/09 11:15:57 ~ {"timestamp":"2024-02-09T19:15:57.936132Z","level":"ERROR","message":"Template error: template not found","target":"text_generation_router::server","filename":"router/src/server.rs","line_number":584,"span":{"name":"chat_completions"},"spans":[{"name":"chat_completions"}]}
2024/02/09 11:16:05 ~ {"timestamp":"2024-02-09T19:16:05.382898Z","level":"ERROR","message":"Template error: template not found","target":"text_generation_router::server","filename":"router/src/server.rs","line_number":584,"span":{"name":"chat_completions"},"spans":[{"name":"chat_completions"}]}

I expected the endpoint to just work. I wonder what caused token file not found and what I should do to help with that

binarycrayon commented 7 months ago

The model repo contains:

special_tokens_map.json
tokenizer.json
tokenizer.model
tokenizer_config.json
adinin commented 7 months ago

I'm experiencing the same with deepseek-ai/deepseek-coder-33b-instruct and mistralai/Mixtral-8x7B-Instruct-v0.1 models (those are the only models I tried with 1.4.0 and latest).

I checked tokenizer_config.json to make sure that chat_template is set. Both models have that set.

I noticed that there was recently a fix around picking up tokenizer_config.json locally. That didn't affect this error.

9876691 commented 7 months ago

Same for me. Steps to reproduce.

Inference

model=TheBloke/Llama-2-7B-Chat-GPTQ
volume=$PWD/data
docker run --gpus all --shm-size 1g -p 8080:80 -v $volume:/data ghcr.io/huggingface/text-generation-inference:1.4 --model-id $model --quantize gptq

Testing non Open AI route

curl http://localhost:8080/generate \
    -X POST \
    -d '{"inputs":"What is Deep Learning?","parameters":{"max_new_tokens":20}}' \
    -H 'Content-Type: application/json'

That works

Test Open AI route

curl -N http://localhost:8080/v1/chat/completions   -H "content-type: application/json"   -d '{
     "model": "TheBloke/Llama-2-7B-Chat-GPTQ",
     "messages": [{"role": "user", "content": "Give me some tips on writing job postings"}]}'

Fails with template error

9876691 commented 7 months ago

Looks like TGI needs the template to squash the chat history. https://github.com/huggingface/text-generation-inference/blob/main/router/src/infer.rs#L94

Does anyone know how to provide the template?

Looks like something like this is needed https://huggingface.co/docs/transformers/chat_templating

So I found in mixtral https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.1/blob/main/tokenizer_config.json

With the section chat_template so if you don't have that section in your models tokenizer_config.json then I guess the Open AI enpoint is not going to work.

adinin commented 7 months ago

@9876691 I agree that that error would make sense if chat_template was missing, but

I checked tokenizer_config.json to make sure that chat_template is set. Both models have that set.

I don't think that's what's causing the problem for me at least.

drbh commented 7 months ago

hi @adinin, I believe the issue is related to the type of the bos_token and eos_token in the tokenizer_config.json. Currently TGI expects the tokens to be of type string but in some cases the config has a more complex type. There is an open PR that should be resolve this issue when https://github.com/huggingface/text-generation-inference/pull/1550 is merged

amihalik commented 7 months ago

hi @adinin, I believe the issue is related to the type of the bos_token and eos_token in the tokenizer_config.json. Currently TGI expects the tokens to be of type string but in some cases the config has a more complex type. There is an open PR that should be resolve this issue when #1550 is merged

If that's the case, this issue and issue #1534 are duplicates.

Thanks for putting in the fix @drbh. I'm looking forward to the update.

vibhorag101 commented 7 months ago

Same issue, I am getting the following error while using llama-2-chat-hf model text_generation_router::server: router/src/server.rs:585: Template error: invalid operation: object has no method named strip (in :1)

gabewillen commented 7 months ago

Same issue, I am getting the following error while using llama-2-chat-hf model text_generation_router::server: router/src/server.rs:585: Template error: invalid operation: object has no method named strip (in :1)

I'm also experiencing this issue. You can pass in your own token config via the command line argument .

--tokenizer-config-path <TOKENIZER_CONFIG_PATH>
   The path to the tokenizer config file. This path is used to load the tokenizer configuration which may include a `chat_template`. If not provided, the default config will be used from the model hub  [env: TOKENIZER_CONFIG_PATH=]

then copy the default token_config.json from the model and replace the chat_template with the below. Make sure you strip your message content yourself when calling it though. This is a workaround until they fix it.

"chat_template": "{% if messages[0]['role'] == 'system' %}{% set loop_messages = messages[1:] %}{% set system_message = messages[0]['content'] %}{% else %}{% set loop_messages = messages %}{% set system_message = false %}{% endif %}{% for message in loop_messages %}{% if (message['role'] == 'user') != (loop.index0 % 2 == 0) %}{{ raise_exception('Conversation roles must alternate user/assistant/user/assistant/...') }}{% endif %}{% if loop.index0 == 0 and system_message != false %}{% set content = '<<SYS>>\n' + system_message + '\n<</SYS>>\n\n' + message['content'] %}{% else %}{% set content = message['content'] %}{% endif %}{% if message['role'] == 'user' %}{{ bos_token + '[INST] ' + content + ' [/INST]' }}{% elif message['role'] == 'assistant' %}{{ ' '  + content + ' ' + eos_token }}{% endif %}{% endfor %}"
vibhorag101 commented 7 months ago

@gabewillen So, we must manually strip out the input message before feeding it to the OpenAI client?

gabewillen commented 7 months ago

@gabewillen So, we must manually strip out the input message before feeding it to the OpenAI client?

@vibhorag101 Just remove the whitespace as that's what was being done in the template and what was causing the failure. so just make sure you do

message = {
    "role": "user",
    "content": content.strip()
    }

That will make sure the template isn't affected

gabewillen commented 7 months ago

Also ensure your first message after the optional system message has a "user" role and that they alternate between "user" and "assistant".

nguyenhoanganh2002 commented 7 months ago

same issue here with Mistral-7B-awq

noddler123 commented 6 months ago

same issue here with llama

SoftDed commented 6 months ago

same issue here with Mistral-7B-awq

In my case, this problem occurred because I was using the "non-instruction" version of the model. It's important to have the "chat_template" section in the tokenizer_config.json file.

https://huggingface.co/TheBloke/Mistral-7B-v0.1-AWQ/blob/main/tokenizer_config.json https://huggingface.co/TheBloke/Mistral-7B-Merge-14-v0.1-AWQ/blob/main/tokenizer_config.json

vibhorag101 commented 6 months ago

For now, the fix suggested by @gabewillen works well for me. But I think this issue is still not fixed yet in the project.

drbh commented 6 months ago

Hi @vibhorag101 the issue is likely due to the .strip() method which is not supported by TGI at the moment. TGI currently strictly supports the jinja spec which uses | trim instead of .strip(). Many templates on the hub follow this syntax but are some still include .strip and other non jinja methods.

We're exploring adding an internal workaround but currently the fastest solutions is to copy the file locally and replace the strip with | trim as well as opening a PR on the HuggingFace Hub on the models that use non jinja syntax.

ibndias commented 6 months ago

Experiencing the same problem with Qwn 72B running with --quantize=bitsandbytes-nf4 https://huggingface.co/Qwen/Qwen1.5-72B-Chat/blob/main/tokenizer_config.json

lihan commented 6 months ago

Hi @gabewillen Thanks for the tips above. Mine has the chat_template https://huggingface.co/TheBloke/openchat-3.5-0106-AWQ/blob/main/tokenizer_config.json#L51

But it raises a different error

openai.UnprocessableEntityError: Error code: 422 - {'error': 'Template error: invalid operation: object has no method named title (in <string>:1)', 'error_type': 'template_error'}

I'm on version 1.4.3

drbh commented 6 months ago

hi @binarycrayon have you been able to resolve this issue by ensuring that the tokenizer_config.json contains a valid chat_template?


Regarding others who are having issues. The template error messages contain information about the specific issue.

For example in the error shared above, the error says object has no method named title which indicated that the title method in the chat template is not valid.

As noted above the TGI strictly uses standard jinja see spec here

Please make sure that the model you are loading is using standard jinja. If a model is not following the standard, I encourage opening PR's on the hub (which will help others running into these issues) like this one. Additionally you can load the model locally and update the chat_template on your machine to resolve the issue.

TLDR;

please update the template to use standard jinja

.title() -> |title .strip() -> |trim

ibndias commented 6 months ago

On Qwen 72B there is nothing wrong with the chat_template https://huggingface.co/PNU-Infosec/Qwen1.5-72B-Chat/blob/main/tokenizer_config.json

Yet i still got template not found

2024-03-27T09:12:10.590140Z  WARN text_generation_router: router/src/main.rs:343: Invalid hostname, defaulting to 0.0.0.0
2024-03-27T09:13:19.203101Z ERROR chat_completions: text_generation_router::server: router/src/server.rs:773: Template error: template not found
2024-03-27T09:13:19.236534Z ERROR chat_completions: text_generation_router::server: router/src/server.rs:773: Template error: template not found

I'm using latest TGI docker 1.4.4

rastna12 commented 6 months ago

I was getting this issue before as well running a Llama2 chat variant a couple weeks ago. I pulled the latest TGI server Docker image (as of a couple weeks ago), which cleared up my issue, so maybe a low-effort potential solution for some folks to try.

Michelklingler commented 5 months ago

Same issue, I am getting the following error while using llama-2-chat-hf model text_generation_router::server: router/src/server.rs:585: Template error: invalid operation: object has no method named strip (in :1)

I'm also experiencing this issue. You can pass in your own token config via the command line argument .

--tokenizer-config-path <TOKENIZER_CONFIG_PATH>
   The path to the tokenizer config file. This path is used to load the tokenizer configuration which may include a `chat_template`. If not provided, the default config will be used from the model hub  [env: TOKENIZER_CONFIG_PATH=]

then copy the default token_config.json from the model and replace the chat_template with the below. Make sure you strip your message content yourself when calling it though. This is a workaround until they fix it.

"chat_template": "{% if messages[0]['role'] == 'system' %}{% set loop_messages = messages[1:] %}{% set system_message = messages[0]['content'] %}{% else %}{% set loop_messages = messages %}{% set system_message = false %}{% endif %}{% for message in loop_messages %}{% if (message['role'] == 'user') != (loop.index0 % 2 == 0) %}{{ raise_exception('Conversation roles must alternate user/assistant/user/assistant/...') }}{% endif %}{% if loop.index0 == 0 and system_message != false %}{% set content = '<<SYS>>\n' + system_message + '\n<</SYS>>\n\n' + message['content'] %}{% else %}{% set content = message['content'] %}{% endif %}{% if message['role'] == 'user' %}{{ bos_token + '[INST] ' + content + ' [/INST]' }}{% elif message['role'] == 'assistant' %}{{ ' '  + content + ' ' + eos_token }}{% endif %}{% endfor %}"

OMG Thanks so much! I couldn't find a way to use system prompt with Mixtral8X7B and vLLM. This works like a charm.

I just modified the chat template in the tokenizer_config.json... It works A1 It's crazy that there is almost no mention of this hack on the web