Open majestichou opened 7 months ago
I got 400 bad request when using this endpoint to connect the TGI Chat UI to vLLM
"endpoints": [{
"type" : "openai",
"baseURL": "http://llm:8000/v1"
}],
Logs from ChatUI
03:40:01 8|index | BadRequestError: 400 status code (no body)
03:40:01 8|index | at APIError.generate (file:///app/build/server/chunks/index-8c2ab54f.js:88218:20)
03:40:01 8|index | at OpenAI.makeStatusError (file:///app/build/server/chunks/index-8c2ab54f.js:88999:25)
03:40:01 8|index | at OpenAI.makeRequest (file:///app/build/server/chunks/index-8c2ab54f.js:89038:30)
03:40:01 8|index | at process.processTicksAndRejections (node:internal/process/task_queues:95:5)
03:40:01 8|index | at async file:///app/build/server/chunks/models-f09d9a41.js:289:9
03:40:01 8|index | at async generateFromDefaultEndpoint (file:///app/build/server/chunks/_server.ts-29ca7ec1.js:36:23)
03:40:01 8|index | at async summarize (file:///app/build/server/chunks/_server.ts-29ca7ec1.js:307:10)
03:40:01 8|index | at async file:///app/build/server/chunks/_server.ts-29ca7ec1.js:468:26 {
03:40:01 8|index | status: 400,
03:40:01 8|index | headers: {
03:40:01 8|index | 'content-length': '269',
03:40:01 8|index | 'content-type': 'application/json',
03:40:01 8|index | date: 'Fri, 15 Mar 2024 03:40:00 GMT',
03:40:01 8|index | server: 'uvicorn'
03:40:01 8|index | },
03:40:01 8|index | error: undefined,
03:40:01 8|index | code: undefined,
03:40:01 8|index | param: undefined,
03:40:01 8|index | type: undefined
03:40:01 8|index | }
Any idea?
I got 400 bad request when using this endpoint to connect the TGI Chat UI to vLLM
"endpoints": [{ "type" : "openai", "baseURL": "http://llm:8000/v1" }],
Logs from ChatUI
03:40:01 8|index | BadRequestError: 400 status code (no body) 03:40:01 8|index | at APIError.generate (file:///app/build/server/chunks/index-8c2ab54f.js:88218:20) 03:40:01 8|index | at OpenAI.makeStatusError (file:///app/build/server/chunks/index-8c2ab54f.js:88999:25) 03:40:01 8|index | at OpenAI.makeRequest (file:///app/build/server/chunks/index-8c2ab54f.js:89038:30) 03:40:01 8|index | at process.processTicksAndRejections (node:internal/process/task_queues:95:5) 03:40:01 8|index | at async file:///app/build/server/chunks/models-f09d9a41.js:289:9 03:40:01 8|index | at async generateFromDefaultEndpoint (file:///app/build/server/chunks/_server.ts-29ca7ec1.js:36:23) 03:40:01 8|index | at async summarize (file:///app/build/server/chunks/_server.ts-29ca7ec1.js:307:10) 03:40:01 8|index | at async file:///app/build/server/chunks/_server.ts-29ca7ec1.js:468:26 { 03:40:01 8|index | status: 400, 03:40:01 8|index | headers: { 03:40:01 8|index | 'content-length': '269', 03:40:01 8|index | 'content-type': 'application/json', 03:40:01 8|index | date: 'Fri, 15 Mar 2024 03:40:00 GMT', 03:40:01 8|index | server: 'uvicorn' 03:40:01 8|index | }, 03:40:01 8|index | error: undefined, 03:40:01 8|index | code: undefined, 03:40:01 8|index | param: undefined, 03:40:01 8|index | type: undefined 03:40:01 8|index | }
Any idea?
you should set id in the .env.local, and the id depends on your vllm config
I used chat-ui-db (https://github.com/huggingface/chat-ui/pkgs/container/chat-ui-db) as the front-end and vllm (https://github.com/vllm-project/vllm) as the back-end for large language model inference. Llama-2-70b-chat-hf model weights used. The content of the .env.local file is as follows:
# Use .env.local to change these variables # DO NOT EDIT THIS FILE WITH SENSITIVE DATA MONGODB_DB_NAME=chat-ui MONGODB_DIRECT_CONNECTION=false COOKIE_NAME=hf-chat HF_TOKEN=#hf_<token> from from https://huggingface.co/settings/token HF_API_ROOT=https://api-inference.huggingface.co/models OPENAI_API_KEY=#your openai api key here HF_ACCESS_TOKEN=#LEGACY! Use HF_TOKEN instead # used to activate search with web functionality. disabled if none are defined. choose one of the following: YDC_API_KEY=#your docs.you.com api key here SERPER_API_KEY=#your serper.dev api key here SERPAPI_KEY=#your serpapi key here SERPSTACK_API_KEY=#your serpstack api key here USE_LOCAL_WEBSEARCH=#set to true to parse google results yourself, overrides other API keys SEARXNG_QUERY_URL=# where '<query>' will be replaced with query keywords see https://docs.searxng.org/dev/search_api.html eg https://searxng.yourdomain.com/search?q=<query>&engines=duckduckgo,google&format=json WEBSEARCH_ALLOWLIST=`[]` # if it's defined, allow websites from only this list. WEBSEARCH_BLOCKLIST=`[]` # if it's defined, block websites from this list. # Parameters to enable open id login OPENID_CONFIG=`{ "PROVIDER_URL": "", "CLIENT_ID": "", "CLIENT_SECRET": "", "SCOPES": "" }` # /!\ legacy openid settings, prefer the config above #OPENID_CLIENT_ID= #OPENID_CLIENT_SECRET= #OPENID_SCOPES="openid profile" # Add "email" for some providers like Google that do not provide preferred_username #OPENID_PROVIDER_URL=https://huggingface.co # for Google, use https://accounts.google.com #OPENID_TOLERANCE= #OPENID_RESOURCE= # Parameters to enable a global mTLS context for client fetch requests USE_CLIENT_CERTIFICATE=false CERT_PATH=# KEY_PATH=# CA_PATH=# CLIENT_KEY_PASSWORD=# REJECT_UNAUTHORIZED=true TEXT_EMBEDDING_MODELS = `[ { "name": "Xenova/gte-small", "displayName": "Xenova/gte-small", "description": "Local embedding model running on the server.", "chunkCharLength": 512, "endpoints": [ { "type": "transformersjs" } ] } ]` # 'name', 'userMessageToken', 'assistantMessageToken' are required MODELS=`[ { "name": "/data/models/Llama-2-70b-chat-hf/", "id": "/data/models/Llama-2-70b-chat-hf/", "endpoints": [{ "type" : "openai", "baseURL": "http://textgen:8000/v1", }], "preprompt": " ", "chatPromptTemplate" : "<s>[INST] <<SYS>>\n{{preprompt}}\n<</SYS>>\n\n{{#each messages}}{{#ifUser}}{{content}} [/INST] {{/ifUser}}{{#ifAssistant}}{{content}} </s><s>[INST] {{/ifAssistant}}{{/each}}", "promptExamples": [ { "title": "Write an email from bullet list", "prompt": "As a restaurant owner, write a professional email to the supplier to get these products every week: \n\n- Wine (x10)\n- Eggs (x24)\n- Bread (x12)" }, { "title": "Code a snake game", "prompt": "Code a basic snake game in python, give explanations for each step." }, { "title": "Assist in a task", "prompt": "How do I make a delicious lemon cheesecake?" } ], "parameters": { "temperature": 0.1, "top_p": 0.95, "repetition_penalty": 1.2, "top_k": 50, "truncate": 1000, "max_new_tokens": 1024, "stop" : ["</s>", "</s><s>[INST]"] } } ]` OLD_MODELS=`[]`# any removed models, `{ name: string, displayName?: string, id?: string }` TASK_MODEL= # name of the model used for tasks such as summarizing title, creating query, etc. PUBLIC_ORIGIN=#https://huggingface.co PUBLIC_SHARE_PREFIX=#https://hf.co/chat PUBLIC_GOOGLE_ANALYTICS_ID=#G-XXXXXXXX / Leave empty to disable PUBLIC_PLAUSIBLE_SCRIPT_URL=#/js/script.js / Leave empty to disable PUBLIC_ANNOUNCEMENT_BANNERS=`[ { "title": "Remember that the results generated by the large language model are not 100% accurate. Please decide for yourself whether you want to take the answers from the large language model." } ]` PARQUET_EXPORT_DATASET= PARQUET_EXPORT_HF_TOKEN= ADMIN_API_SECRET=# secret to admin API calls, like computing usage stats or exporting parquet data PARQUET_EXPORT_SECRET=#DEPRECATED, use ADMIN_API_SECRET instead RATE_LIMIT= # requests per minute MESSAGES_BEFORE_LOGIN=# how many messages a user can send in a conversation before having to login. set to 0 to force login right away APP_BASE="" # base path of the app, e.g. /chat, left blank as default PUBLIC_APP_NAME=WTAGENT # name used as title throughout the app PUBLIC_APP_ASSETS=chatui # used to find logos & favicons in static/$PUBLIC_APP_ASSETS PUBLIC_APP_COLOR=blue # can be any of tailwind colors: https://tailwindcss.com/docs/customizing-colors#default-color-palette PUBLIC_APP_DESCRIPTION=# description used throughout the app (if not set, a default one will be used) PUBLIC_APP_DATA_SHARING=#set to 1 to enable options & text regarding data sharing PUBLIC_APP_DISCLAIMER=#set to 1 to show a disclaimer on login page PUBLIC_APP_DISCLAIMER_MESSAGE="Disclaimer: AI is an area of active research with known problems such as biased generation and misinformation. Do not use this application for high-stakes decisions or advice." LLM_SUMMERIZATION=true EXPOSE_API=true # PUBLIC_APP_NAME=HuggingChat # PUBLIC_APP_ASSETS=huggingchat # PUBLIC_APP_COLOR=yellow # PUBLIC_APP_DESCRIPTION="Making the community's best AI chat models available to everyone." # PUBLIC_APP_DATA_SHARING=1 # PUBLIC_APP_DISCLAIMER=1 ENABLE_ASSISTANTS=false #set to true to enable assistants feature ALTERNATIVE_REDIRECT_URLS=`[]` #valide alternative redirect URL for OAuth WEBHOOK_URL_REPORT_ASSISTANT=#provide webhook url to get notified when an assistant gets reported ALLOWED_USER_EMAILS=`[]` # if it's defined, only these emails will be allowed to use the app MONGODB_URL=mongodb://localhost:27017
The content of the docker-compose.yml file is as follows:
services: chat-ui: container_name: chat-ui image: chat-ui-db:latest ports: - "3000:3000" restart: unless-stopped textgen: container_name: textgen image: vllm/vllm-openai:latest ports: - "8080:8000" ipc: host environment: - TRANSFORMERS_OFFLINE=1 - HF_DATASET_OFFLINE=1 command: --model "/data/models/Llama-2-70b-chat-hf/" --tensor-parallel-size 8 volumes: - /home/mnt/test/llm-test/serving/data/models/:/data/models/ deploy: resources: reservations: devices: - driver: nvidia count: 8 capabilities: [gpu] restart: unless-stopped
I used the docker compose up command to start the large language model service. Then I went to localhost:3000 with the chrome browser. I typed "what can you do?" into the chat box in the chat-ui display. The output of llama-2-70b-chat is as follows:
As a text-based AI assistant, I can help with a variety of tasks. Here are some examples of what I can do: 1. Answer questions: I can answer questions on a wide range of topics, from science and history to entertainment and culture. 2. Provide definitions: If you're unsure of the meaning of a word or phrase, I can provide definitions and explanations. 3. Translate text: I can translate text from one language to another. I currently support translations in dozens of languages. 4. Summarize content: If you have a long piece of text and want to get a quick summary of the main points, I can help with that. 5. Offer suggestions: If you're stuck on a problem or need ideas for something, I can offer suggestions and ideas to help you out. 6. Chat: I can have a conversation with you, answering your questions and engaging in discussion on a wide range of topics. 7. Generate text: I can generate text based on prompts or topics, which can be useful for writing articles, creating content, or even composing emails or messages. 8 . Check grammar and spelling :I can help you catch grammatical errors ,spelling mistakes ,and punctuation errors in your text . 9 . Provide synonyms :If you want to avoid using the same word over and over again ,I cant suggest synonyms that convey the same meaning . 10 . Converse in different languages :I am capable conversing in multiple languages including English ,Spanish ,French among others .Please let me know if there is anything specific way i could assist
According to the output, the output format after the eighth record is abnormal. Then I click "Download prompts and parameters" , The content is displayed as follows:
{ "note": "This is a preview of the prompt that will be sent to the model when retrying the message. It may differ from what was sent in the past if the parameters have been updated since", "prompt": "<s>[INST] <<SYS>>\n \n<</SYS>>\n\nwhat can you do? [/INST] ", "model": "/data/models/Llama-2-70b-chat-hf/", "parameters": { "temperature": 0.1, "truncate": 1000, "max_new_tokens": 1024, "stop": [ "</s>", "</s><s>[INST]" ], "top_p": 0.95, "top_k": 50, "repetition_penalty": 1.2, "stop_sequences": [ "</s>", "</s><s>[INST]" ], "return_full_text": false } }
The output log of the VLLM is as follows:
textgen | INFO 03-08 17:33:29 async_llm_engine.py:436] Received request cmpl-79987e7082494898af14502cd2e9a2f7: prompt: '<s>[INST] <<SYS>>\n\n<</SYS>>\n\nwhat can you do? [/INST]', prefix_pos: None,sampling_params: SamplingParams(n=1, best_of=1, presence_penalty=0.0, frequency_penalty=1.2, repetition_penalty=1.0, temperature=0.1, top_p=0.95, top_k=-1, min_p=0.0, seed=None, use_beam_search=False, length_penalty=1.0, early_stopping=False, stop=['</s>', '</s><s>[INST]'], stop_token_ids=[], include_stop_str_in_output=False, ignore_eos=False, max_tokens=1024, logprobs=None, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True), prompt_token_ids: [1, 1, 518, 25580, 29962, 3532, 14816, 29903, 6778, 13, 13, 29966, 829, 14816, 29903, 6778, 13, 13, 5816, 508, 366, 437, 29973, 518, 29914, 25580, 29962], lora_request: None.
_There's a question here: "topk" in the "Download prompts and parameters" is 50, but in the output log of the VLLM is -1. That's weird.
And then I asked another question. I typed "Code a basic snake game in python, give explanations for each step." into the chat box in the chat-ui display. The output of llama-2-70b-chat is as follows:
# Import the necessary libraries import pygame import time # Initialize pygame pygame.init() # Set the screen dimensions screen_width = 640 screen_height = 480 screen = pygame.display.set_mode((screen_width, screen_height)) pygame.display.set_caption("Snake Game") # set window title)"]])} # set window title)"]])} # set window title)"]})] # set window title)")} # set window title}")"}]]} ]))) }}}}" />)]]]]]]]) }}}}" />)]]]]]]]) }}}}" />)]]]]]"]) }}}}" />)]]]]]"]) }}}}" />)]]}]]} ]}}}"> ]]"}"> ]]"}"> ]]"}"> ]]"}"> ]]"}> [[[[[[]]])> [[[[[[]]])> [[[[[[]]])> [[
The output of Llama-2-70b-chat-hf is completely wrong. Then I click "Download prompts and parameters" , The content is displayed as follows:
{ "note": "This is a preview of the prompt that will be sent to the model when retrying the message. It may differ from what was sent in the past if the parameters have been updated since", "prompt": "<s>[INST] <<SYS>>\n \n<</SYS>>\n\nCode a basic snake game in python, give explanations for each step. [/INST] ", "model": "/data/models/Llama-2-70b-chat-hf/", "parameters": { "temperature": 0.1, "truncate": 1000, "max_new_tokens": 1024, "stop": [ "</s>", "</s><s>[INST]" ], "top_p": 0.95, "top_k": 50, "repetition_penalty": 1.2, "stop_sequences": [ "</s>", "</s><s>[INST]" ], "return_full_text": false } }
The output log of the VLLM is as follows:
textgen | INFO 03-08 18:06:42 async_llm_engine.py:436] Received request cmpl-aa52bf8ea8684e36846defd5e5a3f7be: prompt: '<s>[INST] <<SYS>>\n\n<</SYS>>\n\nCode a basic snake game in python, give explanations for each step. [/INST]', prefix_pos: None,sampling_params: SamplingParams(n=1, best_of=1, presence_penalty=0.0, frequency_penalty=1.2, repetition_penalty=1.0, temperature=0.1, top_p=0.95, top_k=-1, min_p=0.0, seed=None, use_beam_search=False, length_penalty=1.0, early_stopping=False, stop=['</s>', '</s><s>[INST]'], stop_token_ids=[], include_stop_str_in_output=False, ignore_eos=False, max_tokens=1024, logprobs=None, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True), prompt_token_ids: [1, 1, 518, 25580, 29962, 3532, 14816, 29903, 6778, 13, 13, 29966, 829, 14816, 29903, 6778, 13, 13, 3399, 263, 6996, 269, 21040, 3748, 297, 3017, 29892, 2367, 7309, 800, 363, 1269, 4331, 29889, 518, 29914, 25580, 29962], lora_request: None.
_There's the same question here: "topk" in the "Download prompts and parameters" is 50, but in the output log of the VLLM is -1. That's weird.Summarizing the above, when chat-ui and vllm are used together, the dialogue output of Llama-2-70b-chat-hf is abnormal.
any progress?
I used chat-ui-db (https://github.com/huggingface/chat-ui/pkgs/container/chat-ui-db) as the front-end and vllm (https://github.com/vllm-project/vllm) as the back-end for large language model inference. Llama-2-70b-chat-hf model weights used. The content of the .env.local file is as follows:
The content of the docker-compose.yml file is as follows:
I used the docker compose up command to start the large language model service. Then I went to localhost:3000 with the chrome browser. I typed "what can you do?" into the chat box in the chat-ui display. The output of llama-2-70b-chat is as follows:
According to the output, the output format after the eighth record is abnormal. Then I click "Download prompts and parameters" , The content is displayed as follows:
The output log of the VLLM is as follows:
textgen | INFO 03-08 17:33:29 async_llm_engine.py:436] Received request cmpl-79987e7082494898af14502cd2e9a2f7: prompt: '<s>[INST] <<SYS>>\n\n<</SYS>>\n\nwhat can you do? [/INST]', prefix_pos: None,sampling_params: SamplingParams(n=1, best_of=1, presence_penalty=0.0, frequency_penalty=1.2, repetition_penalty=1.0, temperature=0.1, top_p=0.95, top_k=-1, min_p=0.0, seed=None, use_beam_search=False, length_penalty=1.0, early_stopping=False, stop=['</s>', '</s><s>[INST]'], stop_token_ids=[], include_stop_str_in_output=False, ignore_eos=False, max_tokens=1024, logprobs=None, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True), prompt_token_ids: [1, 1, 518, 25580, 29962, 3532, 14816, 29903, 6778, 13, 13, 29966, 829, 14816, 29903, 6778, 13, 13, 5816, 508, 366, 437, 29973, 518, 29914, 25580, 29962], lora_request: None.
_There's a question here: "topk" in the "Download prompts and parameters" is 50, but in the output log of the VLLM is -1. That's weird.
And then I asked another question. I typed "Code a basic snake game in python, give explanations for each step." into the chat box in the chat-ui display. The output of llama-2-70b-chat is as follows:
The output of Llama-2-70b-chat-hf is completely wrong. Then I click "Download prompts and parameters" , The content is displayed as follows:
The output log of the VLLM is as follows:
textgen | INFO 03-08 18:06:42 async_llm_engine.py:436] Received request cmpl-aa52bf8ea8684e36846defd5e5a3f7be: prompt: '<s>[INST] <<SYS>>\n\n<</SYS>>\n\nCode a basic snake game in python, give explanations for each step. [/INST]', prefix_pos: None,sampling_params: SamplingParams(n=1, best_of=1, presence_penalty=0.0, frequency_penalty=1.2, repetition_penalty=1.0, temperature=0.1, top_p=0.95, top_k=-1, min_p=0.0, seed=None, use_beam_search=False, length_penalty=1.0, early_stopping=False, stop=['</s>', '</s><s>[INST]'], stop_token_ids=[], include_stop_str_in_output=False, ignore_eos=False, max_tokens=1024, logprobs=None, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True), prompt_token_ids: [1, 1, 518, 25580, 29962, 3532, 14816, 29903, 6778, 13, 13, 29966, 829, 14816, 29903, 6778, 13, 13, 3399, 263, 6996, 269, 21040, 3748, 297, 3017, 29892, 2367, 7309, 800, 363, 1269, 4331, 29889, 518, 29914, 25580, 29962], lora_request: None.
_There's the same question here: "topk" in the "Download prompts and parameters" is 50, but in the output log of the VLLM is -1. That's weird.Summarizing the above, when chat-ui and vllm are used together, the dialogue output of Llama-2-70b-chat-hf is abnormal.