When chat-ui and vllm are used together, the dialogue output of Llama-2-70b-chat-hf is abnormal.

majestichou commented 7 months ago

I used chat-ui-db (https://github.com/huggingface/chat-ui/pkgs/container/chat-ui-db) as the front-end and vllm (https://github.com/vllm-project/vllm) as the back-end for large language model inference. Llama-2-70b-chat-hf model weights used. The content of the .env.local file is as follows:

# Use .env.local to change these variables
# DO NOT EDIT THIS FILE WITH SENSITIVE DATA

MONGODB_DB_NAME=chat-ui
MONGODB_DIRECT_CONNECTION=false

COOKIE_NAME=hf-chat
HF_TOKEN=#hf_<token> from from https://huggingface.co/settings/token
HF_API_ROOT=https://api-inference.huggingface.co/models
OPENAI_API_KEY=#your openai api key here

HF_ACCESS_TOKEN=#LEGACY! Use HF_TOKEN instead

# used to activate search with web functionality. disabled if none are defined. choose one of the following:
YDC_API_KEY=#your docs.you.com api key here
SERPER_API_KEY=#your serper.dev api key here
SERPAPI_KEY=#your serpapi key here
SERPSTACK_API_KEY=#your serpstack api key here
USE_LOCAL_WEBSEARCH=#set to true to parse google results yourself, overrides other API keys
SEARXNG_QUERY_URL=# where '<query>' will be replaced with query keywords see https://docs.searxng.org/dev/search_api.html eg https://searxng.yourdomain.com/search?q=<query>&engines=duckduckgo,google&format=json

WEBSEARCH_ALLOWLIST=`[]` # if it's defined, allow websites from only this list.
WEBSEARCH_BLOCKLIST=`[]` # if it's defined, block websites from this list.

# Parameters to enable open id login
OPENID_CONFIG=`{
  "PROVIDER_URL": "",
  "CLIENT_ID": "",
  "CLIENT_SECRET": "",
  "SCOPES": ""
}`

# /!\ legacy openid settings, prefer the config above
#OPENID_CLIENT_ID=
#OPENID_CLIENT_SECRET=
#OPENID_SCOPES="openid profile" # Add "email" for some providers like Google that do not provide preferred_username
#OPENID_PROVIDER_URL=https://huggingface.co # for Google, use https://accounts.google.com
#OPENID_TOLERANCE=
#OPENID_RESOURCE=

# Parameters to enable a global mTLS context for client fetch requests
USE_CLIENT_CERTIFICATE=false
CERT_PATH=#
KEY_PATH=#
CA_PATH=#
CLIENT_KEY_PASSWORD=#
REJECT_UNAUTHORIZED=true
TEXT_EMBEDDING_MODELS = `[
  {
    "name": "Xenova/gte-small",
    "displayName": "Xenova/gte-small",
    "description": "Local embedding model running on the server.",
    "chunkCharLength": 512,
    "endpoints": [
      { "type": "transformersjs" }
    ]
  }
]`

# 'name', 'userMessageToken', 'assistantMessageToken' are required
MODELS=`[
     {
      "name": "/data/models/Llama-2-70b-chat-hf/",
      "id": "/data/models/Llama-2-70b-chat-hf/",
      "endpoints": [{
        "type" : "openai",
        "baseURL": "http://textgen:8000/v1",
        }],
      "preprompt": " ",
      "chatPromptTemplate" : "<s>[INST] <<SYS>>\n{{preprompt}}\n<</SYS>>\n\n{{#each messages}}{{#ifUser}}{{content}} [/INST] {{/ifUser}}{{#ifAssistant}}{{content}} </s><s>[INST] {{/ifAssistant}}{{/each}}",
      "promptExamples": [
        {
          "title": "Write an email from bullet list",
          "prompt": "As a restaurant owner, write a professional email to the supplier to get these products every week: \n\n- Wine (x10)\n- Eggs (x24)\n- Bread (x12)"
        }, {
          "title": "Code a snake game",
          "prompt": "Code a basic snake game in python, give explanations for each step."
        }, {
          "title": "Assist in a task",
          "prompt": "How do I make a delicious lemon cheesecake?"
        }
      ],
      "parameters": {
        "temperature": 0.1,
        "top_p": 0.95,
        "repetition_penalty": 1.2,
        "top_k": 50,
        "truncate": 1000,
        "max_new_tokens": 1024,
        "stop" : ["</s>", "</s><s>[INST]"]
      }
    }
]`
OLD_MODELS=`[]`# any removed models, `{ name: string, displayName?: string, id?: string }`
TASK_MODEL= # name of the model used for tasks such as summarizing title, creating query, etc.

PUBLIC_ORIGIN=#https://huggingface.co
PUBLIC_SHARE_PREFIX=#https://hf.co/chat
PUBLIC_GOOGLE_ANALYTICS_ID=#G-XXXXXXXX / Leave empty to disable
PUBLIC_PLAUSIBLE_SCRIPT_URL=#/js/script.js / Leave empty to disable
PUBLIC_ANNOUNCEMENT_BANNERS=`[
    {
    "title": "Remember that the results generated by the large language model are not 100% accurate. Please decide for yourself whether you want to take the answers from the large language model."
  }
]`

PARQUET_EXPORT_DATASET=
PARQUET_EXPORT_HF_TOKEN=
ADMIN_API_SECRET=# secret to admin API calls, like computing usage stats or exporting parquet data

PARQUET_EXPORT_SECRET=#DEPRECATED, use ADMIN_API_SECRET instead

RATE_LIMIT= # requests per minute
MESSAGES_BEFORE_LOGIN=# how many messages a user can send in a conversation before having to login. set to 0 to force login right away

APP_BASE="" # base path of the app, e.g. /chat, left blank as default
PUBLIC_APP_NAME=WTAGENT # name used as title throughout the app
PUBLIC_APP_ASSETS=chatui # used to find logos & favicons in static/$PUBLIC_APP_ASSETS
PUBLIC_APP_COLOR=blue # can be any of tailwind colors: https://tailwindcss.com/docs/customizing-colors#default-color-palette
PUBLIC_APP_DESCRIPTION=# description used throughout the app (if not set, a default one will be used)
PUBLIC_APP_DATA_SHARING=#set to 1 to enable options & text regarding data sharing
PUBLIC_APP_DISCLAIMER=#set to 1 to show a disclaimer on login page
PUBLIC_APP_DISCLAIMER_MESSAGE="Disclaimer: AI is an area of active research with known problems such as biased generation and misinformation. Do not use this application for high-stakes decisions or advice."
LLM_SUMMERIZATION=true

EXPOSE_API=true
# PUBLIC_APP_NAME=HuggingChat
# PUBLIC_APP_ASSETS=huggingchat
# PUBLIC_APP_COLOR=yellow
# PUBLIC_APP_DESCRIPTION="Making the community's best AI chat models available to everyone."
# PUBLIC_APP_DATA_SHARING=1
# PUBLIC_APP_DISCLAIMER=1

ENABLE_ASSISTANTS=false #set to true to enable assistants feature

ALTERNATIVE_REDIRECT_URLS=`[]` #valide alternative redirect URL for OAuth

WEBHOOK_URL_REPORT_ASSISTANT=#provide webhook url to get notified when an assistant gets reported

ALLOWED_USER_EMAILS=`[]` # if it's defined, only these emails will be allowed to use the app
MONGODB_URL=mongodb://localhost:27017

The content of the docker-compose.yml file is as follows:

services:
  chat-ui:
    container_name: chat-ui
    image: chat-ui-db:latest
    ports:
      - "3000:3000"
    restart: unless-stopped

  textgen:
    container_name: textgen
    image: vllm/vllm-openai:latest
    ports:
      - "8080:8000"
    ipc: host
    environment:
      - TRANSFORMERS_OFFLINE=1
      - HF_DATASET_OFFLINE=1
    command: --model "/data/models/Llama-2-70b-chat-hf/" --tensor-parallel-size 8
    volumes:
      - /home/mnt/test/llm-test/serving/data/models/:/data/models/
    deploy:
      resources:
        reservations:
          devices:
          - driver: nvidia
            count: 8
            capabilities: [gpu]
    restart: unless-stopped

I used the docker compose up command to start the large language model service. Then I went to localhost:3000 with the chrome browser. I typed "what can you do?" into the chat box in the chat-ui display. The output of llama-2-70b-chat is as follows:

  As a text-based AI assistant, I can help with a variety of tasks. Here are some examples of what I can do:

1. Answer questions: I can answer questions on a wide range of topics, from science and history to entertainment and culture.
2. Provide definitions: If you're unsure of the meaning of a word or phrase, I can provide definitions and explanations.
3. Translate text: I can translate text from one language to another. I currently support translations in dozens of languages.
4. Summarize content: If you have a long piece of text and want to get a quick summary of the main points, I can help with that.
5. Offer suggestions: If you're stuck on a problem or need ideas for something, I can offer suggestions and ideas to help you out.
6. Chat: I can have a conversation with you, answering your questions and engaging in discussion on a wide range of topics.
7. Generate text: I can generate text based on prompts or topics, which can be useful for writing articles, creating content, or even composing emails or messages.
8 . Check grammar and spelling :I can help you catch grammatical errors ,spelling mistakes ,and punctuation errors in your text . 9 . Provide synonyms :If you want to avoid using the same word over and over again ,I  cant suggest synonyms that convey the same meaning . 10 . Converse in different languages :I am capable conversing in multiple languages including English ,Spanish ,French among others .Please let me know if there is anything specific way i could assist

According to the output, the output format after the eighth record is abnormal. Then I click "Download prompts and parameters" , The content is displayed as follows:

{
  "note": "This is a preview of the prompt that will be sent to the model when retrying the message. It may differ from what was sent in the past if the parameters have been updated since",
  "prompt": "<s>[INST] <<SYS>>\n \n<</SYS>>\n\nwhat can you do? [/INST] ",
  "model": "/data/models/Llama-2-70b-chat-hf/",
  "parameters": {
    "temperature": 0.1,
    "truncate": 1000,
    "max_new_tokens": 1024,
    "stop": [
      "</s>",
      "</s><s>[INST]"
    ],
    "top_p": 0.95,
    "top_k": 50,
    "repetition_penalty": 1.2,
    "stop_sequences": [
      "</s>",
      "</s><s>[INST]"
    ],
    "return_full_text": false
  }
}

The output log of the VLLM is as follows: textgen | INFO 03-08 17:33:29 async_llm_engine.py:436] Received request cmpl-79987e7082494898af14502cd2e9a2f7: prompt: '<s>[INST] <<SYS>>\n\n<</SYS>>\n\nwhat can you do? [/INST]', prefix_pos: None,sampling_params: SamplingParams(n=1, best_of=1, presence_penalty=0.0, frequency_penalty=1.2, repetition_penalty=1.0, temperature=0.1, top_p=0.95, top_k=-1, min_p=0.0, seed=None, use_beam_search=False, length_penalty=1.0, early_stopping=False, stop=['</s>', '</s><s>[INST]'], stop_token_ids=[], include_stop_str_in_output=False, ignore_eos=False, max_tokens=1024, logprobs=None, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True), prompt_token_ids: [1, 1, 518, 25580, 29962, 3532, 14816, 29903, 6778, 13, 13, 29966, 829, 14816, 29903, 6778, 13, 13, 5816, 508, 366, 437, 29973, 518, 29914, 25580, 29962], lora_request: None.

_There's a question here: "topk" in the "Download prompts and parameters" is 50, but in the output log of the VLLM is -1. That's weird.

And then I asked another question. I typed "Code a basic snake game in python, give explanations for each step." into the chat box in the chat-ui display. The output of llama-2-70b-chat is as follows:

# Import the necessary libraries
import pygame
import time

# Initialize pygame
pygame.init()

# Set the screen dimensions
screen_width = 640
screen_height = 480
screen = pygame.display.set_mode((screen_width, screen_height))
pygame.display.set_caption("Snake Game")  # set window title)"]])}   # set window title)"]])}   # set window title)"]})]   # set window title)")}   # set window title}")"}]]}    ]))) }}}}" />)]]]]]]]) }}}}" />)]]]]]]]) }}}}" />)]]]]]"]) }}}}" />)]]]]]"]) }}}}" />)]]}]]}    ]}}}"> ]]"}"> ]]"}"> ]]"}"> ]]"}"> ]]"}> [[[[[[]]])> [[[[[[]]])> [[[[[[]]])> [[

The output of Llama-2-70b-chat-hf is completely wrong. Then I click "Download prompts and parameters" , The content is displayed as follows:

{
  "note": "This is a preview of the prompt that will be sent to the model when retrying the message. It may differ from what was sent in the past if the parameters have been updated since",
  "prompt": "<s>[INST] <<SYS>>\n \n<</SYS>>\n\nCode a basic snake game in python, give explanations for each step. [/INST] ",
  "model": "/data/models/Llama-2-70b-chat-hf/",
  "parameters": {
    "temperature": 0.1,
    "truncate": 1000,
    "max_new_tokens": 1024,
    "stop": [
      "</s>",
      "</s><s>[INST]"
    ],
    "top_p": 0.95,
    "top_k": 50,
    "repetition_penalty": 1.2,
    "stop_sequences": [
      "</s>",
      "</s><s>[INST]"
    ],
    "return_full_text": false
  }
}

The output log of the VLLM is as follows: textgen | INFO 03-08 18:06:42 async_llm_engine.py:436] Received request cmpl-aa52bf8ea8684e36846defd5e5a3f7be: prompt: '<s>[INST] <<SYS>>\n\n<</SYS>>\n\nCode a basic snake game in python, give explanations for each step. [/INST]', prefix_pos: None,sampling_params: SamplingParams(n=1, best_of=1, presence_penalty=0.0, frequency_penalty=1.2, repetition_penalty=1.0, temperature=0.1, top_p=0.95, top_k=-1, min_p=0.0, seed=None, use_beam_search=False, length_penalty=1.0, early_stopping=False, stop=['</s>', '</s><s>[INST]'], stop_token_ids=[], include_stop_str_in_output=False, ignore_eos=False, max_tokens=1024, logprobs=None, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True), prompt_token_ids: [1, 1, 518, 25580, 29962, 3532, 14816, 29903, 6778, 13, 13, 29966, 829, 14816, 29903, 6778, 13, 13, 3399, 263, 6996, 269, 21040, 3748, 297, 3017, 29892, 2367, 7309, 800, 363, 1269, 4331, 29889, 518, 29914, 25580, 29962], lora_request: None. _There's the same question here: "topk" in the "Download prompts and parameters" is 50, but in the output log of the VLLM is -1. That's weird.

Summarizing the above, when chat-ui and vllm are used together, the dialogue output of Llama-2-70b-chat-hf is abnormal.

muhammad-asn commented 7 months ago

I got 400 bad request when using this endpoint to connect the TGI Chat UI to vLLM

      "endpoints": [{
        "type" : "openai",
        "baseURL": "http://llm:8000/v1"
      }],

Logs from ChatUI

03:40:01 8|index  | BadRequestError: 400 status code (no body)
03:40:01 8|index  |     at APIError.generate (file:///app/build/server/chunks/index-8c2ab54f.js:88218:20)
03:40:01 8|index  |     at OpenAI.makeStatusError (file:///app/build/server/chunks/index-8c2ab54f.js:88999:25)
03:40:01 8|index  |     at OpenAI.makeRequest (file:///app/build/server/chunks/index-8c2ab54f.js:89038:30)
03:40:01 8|index  |     at process.processTicksAndRejections (node:internal/process/task_queues:95:5)
03:40:01 8|index  |     at async file:///app/build/server/chunks/models-f09d9a41.js:289:9
03:40:01 8|index  |     at async generateFromDefaultEndpoint (file:///app/build/server/chunks/_server.ts-29ca7ec1.js:36:23)
03:40:01 8|index  |     at async summarize (file:///app/build/server/chunks/_server.ts-29ca7ec1.js:307:10)
03:40:01 8|index  |     at async file:///app/build/server/chunks/_server.ts-29ca7ec1.js:468:26 {
03:40:01 8|index  |   status: 400,
03:40:01 8|index  |   headers: {
03:40:01 8|index  |     'content-length': '269',
03:40:01 8|index  |     'content-type': 'application/json',
03:40:01 8|index  |     date: 'Fri, 15 Mar 2024 03:40:00 GMT',
03:40:01 8|index  |     server: 'uvicorn'
03:40:01 8|index  |   },
03:40:01 8|index  |   error: undefined,
03:40:01 8|index  |   code: undefined,
03:40:01 8|index  |   param: undefined,
03:40:01 8|index  |   type: undefined
03:40:01 8|index  | }

Any idea?

majestichou commented 7 months ago

I got 400 bad request when using this endpoint to connect the TGI Chat UI to vLLM

      "endpoints": [{
        "type" : "openai",
        "baseURL": "http://llm:8000/v1"
      }],

Logs from ChatUI

03:40:01 8|index  | BadRequestError: 400 status code (no body)
03:40:01 8|index  |     at APIError.generate (file:///app/build/server/chunks/index-8c2ab54f.js:88218:20)
03:40:01 8|index  |     at OpenAI.makeStatusError (file:///app/build/server/chunks/index-8c2ab54f.js:88999:25)
03:40:01 8|index  |     at OpenAI.makeRequest (file:///app/build/server/chunks/index-8c2ab54f.js:89038:30)
03:40:01 8|index  |     at process.processTicksAndRejections (node:internal/process/task_queues:95:5)
03:40:01 8|index  |     at async file:///app/build/server/chunks/models-f09d9a41.js:289:9
03:40:01 8|index  |     at async generateFromDefaultEndpoint (file:///app/build/server/chunks/_server.ts-29ca7ec1.js:36:23)
03:40:01 8|index  |     at async summarize (file:///app/build/server/chunks/_server.ts-29ca7ec1.js:307:10)
03:40:01 8|index  |     at async file:///app/build/server/chunks/_server.ts-29ca7ec1.js:468:26 {
03:40:01 8|index  |   status: 400,
03:40:01 8|index  |   headers: {
03:40:01 8|index  |     'content-length': '269',
03:40:01 8|index  |     'content-type': 'application/json',
03:40:01 8|index  |     date: 'Fri, 15 Mar 2024 03:40:00 GMT',
03:40:01 8|index  |     server: 'uvicorn'
03:40:01 8|index  |   },
03:40:01 8|index  |   error: undefined,
03:40:01 8|index  |   code: undefined,
03:40:01 8|index  |   param: undefined,
03:40:01 8|index  |   type: undefined
03:40:01 8|index  | }

Any idea?

you should set id in the .env.local, and the id depends on your vllm config

Extremys commented 6 months ago

I used chat-ui-db (https://github.com/huggingface/chat-ui/pkgs/container/chat-ui-db) as the front-end and vllm (https://github.com/vllm-project/vllm) as the back-end for large language model inference. Llama-2-70b-chat-hf model weights used. The content of the .env.local file is as follows:

# Use .env.local to change these variables
# DO NOT EDIT THIS FILE WITH SENSITIVE DATA

MONGODB_DB_NAME=chat-ui
MONGODB_DIRECT_CONNECTION=false

COOKIE_NAME=hf-chat
HF_TOKEN=#hf_<token> from from https://huggingface.co/settings/token
HF_API_ROOT=https://api-inference.huggingface.co/models
OPENAI_API_KEY=#your openai api key here

HF_ACCESS_TOKEN=#LEGACY! Use HF_TOKEN instead

# used to activate search with web functionality. disabled if none are defined. choose one of the following:
YDC_API_KEY=#your docs.you.com api key here
SERPER_API_KEY=#your serper.dev api key here
SERPAPI_KEY=#your serpapi key here
SERPSTACK_API_KEY=#your serpstack api key here
USE_LOCAL_WEBSEARCH=#set to true to parse google results yourself, overrides other API keys
SEARXNG_QUERY_URL=# where '<query>' will be replaced with query keywords see https://docs.searxng.org/dev/search_api.html eg https://searxng.yourdomain.com/search?q=<query>&engines=duckduckgo,google&format=json

WEBSEARCH_ALLOWLIST=`[]` # if it's defined, allow websites from only this list.
WEBSEARCH_BLOCKLIST=`[]` # if it's defined, block websites from this list.

# Parameters to enable open id login
OPENID_CONFIG=`{
  "PROVIDER_URL": "",
  "CLIENT_ID": "",
  "CLIENT_SECRET": "",
  "SCOPES": ""
}`

# /!\ legacy openid settings, prefer the config above
#OPENID_CLIENT_ID=
#OPENID_CLIENT_SECRET=
#OPENID_SCOPES="openid profile" # Add "email" for some providers like Google that do not provide preferred_username
#OPENID_PROVIDER_URL=https://huggingface.co # for Google, use https://accounts.google.com
#OPENID_TOLERANCE=
#OPENID_RESOURCE=

# Parameters to enable a global mTLS context for client fetch requests
USE_CLIENT_CERTIFICATE=false
CERT_PATH=#
KEY_PATH=#
CA_PATH=#
CLIENT_KEY_PASSWORD=#
REJECT_UNAUTHORIZED=true
TEXT_EMBEDDING_MODELS = `[
  {
    "name": "Xenova/gte-small",
    "displayName": "Xenova/gte-small",
    "description": "Local embedding model running on the server.",
    "chunkCharLength": 512,
    "endpoints": [
      { "type": "transformersjs" }
    ]
  }
]`

# 'name', 'userMessageToken', 'assistantMessageToken' are required
MODELS=`[
     {
        "name": "/data/models/Llama-2-70b-chat-hf/",
        "id": "/data/models/Llama-2-70b-chat-hf/",
      "endpoints": [{
        "type" : "openai",
        "baseURL": "http://textgen:8000/v1",
        }],
      "preprompt": " ",
      "chatPromptTemplate" : "<s>[INST] <<SYS>>\n{{preprompt}}\n<</SYS>>\n\n{{#each messages}}{{#ifUser}}{{content}} [/INST] {{/ifUser}}{{#ifAssistant}}{{content}} </s><s>[INST] {{/ifAssistant}}{{/each}}",
      "promptExamples": [
        {
          "title": "Write an email from bullet list",
          "prompt": "As a restaurant owner, write a professional email to the supplier to get these products every week: \n\n- Wine (x10)\n- Eggs (x24)\n- Bread (x12)"
        }, {
          "title": "Code a snake game",
          "prompt": "Code a basic snake game in python, give explanations for each step."
        }, {
          "title": "Assist in a task",
          "prompt": "How do I make a delicious lemon cheesecake?"
        }
      ],
      "parameters": {
        "temperature": 0.1,
        "top_p": 0.95,
        "repetition_penalty": 1.2,
        "top_k": 50,
        "truncate": 1000,
        "max_new_tokens": 1024,
        "stop" : ["</s>", "</s><s>[INST]"]
      }
    }
]`
OLD_MODELS=`[]`# any removed models, `{ name: string, displayName?: string, id?: string }`
TASK_MODEL= # name of the model used for tasks such as summarizing title, creating query, etc.

PUBLIC_ORIGIN=#https://huggingface.co
PUBLIC_SHARE_PREFIX=#https://hf.co/chat
PUBLIC_GOOGLE_ANALYTICS_ID=#G-XXXXXXXX / Leave empty to disable
PUBLIC_PLAUSIBLE_SCRIPT_URL=#/js/script.js / Leave empty to disable
PUBLIC_ANNOUNCEMENT_BANNERS=`[
    {
    "title": "Remember that the results generated by the large language model are not 100% accurate. Please decide for yourself whether you want to take the answers from the large language model."
  }
]`

PARQUET_EXPORT_DATASET=
PARQUET_EXPORT_HF_TOKEN=
ADMIN_API_SECRET=# secret to admin API calls, like computing usage stats or exporting parquet data

PARQUET_EXPORT_SECRET=#DEPRECATED, use ADMIN_API_SECRET instead

RATE_LIMIT= # requests per minute
MESSAGES_BEFORE_LOGIN=# how many messages a user can send in a conversation before having to login. set to 0 to force login right away

APP_BASE="" # base path of the app, e.g. /chat, left blank as default
PUBLIC_APP_NAME=WTAGENT # name used as title throughout the app
PUBLIC_APP_ASSETS=chatui # used to find logos & favicons in static/$PUBLIC_APP_ASSETS
PUBLIC_APP_COLOR=blue # can be any of tailwind colors: https://tailwindcss.com/docs/customizing-colors#default-color-palette
PUBLIC_APP_DESCRIPTION=# description used throughout the app (if not set, a default one will be used)
PUBLIC_APP_DATA_SHARING=#set to 1 to enable options & text regarding data sharing
PUBLIC_APP_DISCLAIMER=#set to 1 to show a disclaimer on login page
PUBLIC_APP_DISCLAIMER_MESSAGE="Disclaimer: AI is an area of active research with known problems such as biased generation and misinformation. Do not use this application for high-stakes decisions or advice."
LLM_SUMMERIZATION=true

EXPOSE_API=true
# PUBLIC_APP_NAME=HuggingChat
# PUBLIC_APP_ASSETS=huggingchat
# PUBLIC_APP_COLOR=yellow
# PUBLIC_APP_DESCRIPTION="Making the community's best AI chat models available to everyone."
# PUBLIC_APP_DATA_SHARING=1
# PUBLIC_APP_DISCLAIMER=1

ENABLE_ASSISTANTS=false #set to true to enable assistants feature

ALTERNATIVE_REDIRECT_URLS=`[]` #valide alternative redirect URL for OAuth

WEBHOOK_URL_REPORT_ASSISTANT=#provide webhook url to get notified when an assistant gets reported

ALLOWED_USER_EMAILS=`[]` # if it's defined, only these emails will be allowed to use the app
MONGODB_URL=mongodb://localhost:27017

The content of the docker-compose.yml file is as follows:

services:
  chat-ui:
    container_name: chat-ui
    image: chat-ui-db:latest
    ports:
      - "3000:3000"
    restart: unless-stopped

  textgen:
    container_name: textgen
    image: vllm/vllm-openai:latest
    ports:
      - "8080:8000"
    ipc: host
    environment:
      - TRANSFORMERS_OFFLINE=1
      - HF_DATASET_OFFLINE=1
    command: --model "/data/models/Llama-2-70b-chat-hf/" --tensor-parallel-size 8
    volumes:
      - /home/mnt/test/llm-test/serving/data/models/:/data/models/
    deploy:
      resources:
        reservations:
          devices:
          - driver: nvidia
            count: 8
            capabilities: [gpu]
    restart: unless-stopped

I used the docker compose up command to start the large language model service. Then I went to localhost:3000 with the chrome browser. I typed "what can you do?" into the chat box in the chat-ui display. The output of llama-2-70b-chat is as follows:

  As a text-based AI assistant, I can help with a variety of tasks. Here are some examples of what I can do:

1. Answer questions: I can answer questions on a wide range of topics, from science and history to entertainment and culture.
2. Provide definitions: If you're unsure of the meaning of a word or phrase, I can provide definitions and explanations.
3. Translate text: I can translate text from one language to another. I currently support translations in dozens of languages.
4. Summarize content: If you have a long piece of text and want to get a quick summary of the main points, I can help with that.
5. Offer suggestions: If you're stuck on a problem or need ideas for something, I can offer suggestions and ideas to help you out.
6. Chat: I can have a conversation with you, answering your questions and engaging in discussion on a wide range of topics.
7. Generate text: I can generate text based on prompts or topics, which can be useful for writing articles, creating content, or even composing emails or messages.
8 . Check grammar and spelling :I can help you catch grammatical errors ,spelling mistakes ,and punctuation errors in your text . 9 . Provide synonyms :If you want to avoid using the same word over and over again ,I  cant suggest synonyms that convey the same meaning . 10 . Converse in different languages :I am capable conversing in multiple languages including English ,Spanish ,French among others .Please let me know if there is anything specific way i could assist

According to the output, the output format after the eighth record is abnormal. Then I click "Download prompts and parameters" , The content is displayed as follows:

{
  "note": "This is a preview of the prompt that will be sent to the model when retrying the message. It may differ from what was sent in the past if the parameters have been updated since",
  "prompt": "<s>[INST] <<SYS>>\n \n<</SYS>>\n\nwhat can you do? [/INST] ",
  "model": "/data/models/Llama-2-70b-chat-hf/",
  "parameters": {
    "temperature": 0.1,
    "truncate": 1000,
    "max_new_tokens": 1024,
    "stop": [
      "</s>",
      "</s><s>[INST]"
    ],
    "top_p": 0.95,
    "top_k": 50,
    "repetition_penalty": 1.2,
    "stop_sequences": [
      "</s>",
      "</s><s>[INST]"
    ],
    "return_full_text": false
  }
}

The output log of the VLLM is as follows: textgen | INFO 03-08 17:33:29 async_llm_engine.py:436] Received request cmpl-79987e7082494898af14502cd2e9a2f7: prompt: '<s>[INST] <<SYS>>\n\n<</SYS>>\n\nwhat can you do? [/INST]', prefix_pos: None,sampling_params: SamplingParams(n=1, best_of=1, presence_penalty=0.0, frequency_penalty=1.2, repetition_penalty=1.0, temperature=0.1, top_p=0.95, top_k=-1, min_p=0.0, seed=None, use_beam_search=False, length_penalty=1.0, early_stopping=False, stop=['</s>', '</s><s>[INST]'], stop_token_ids=[], include_stop_str_in_output=False, ignore_eos=False, max_tokens=1024, logprobs=None, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True), prompt_token_ids: [1, 1, 518, 25580, 29962, 3532, 14816, 29903, 6778, 13, 13, 29966, 829, 14816, 29903, 6778, 13, 13, 5816, 508, 366, 437, 29973, 518, 29914, 25580, 29962], lora_request: None.

_There's a question here: "topk" in the "Download prompts and parameters" is 50, but in the output log of the VLLM is -1. That's weird.

And then I asked another question. I typed "Code a basic snake game in python, give explanations for each step." into the chat box in the chat-ui display. The output of llama-2-70b-chat is as follows:

# Import the necessary libraries
import pygame
import time

# Initialize pygame
pygame.init()

# Set the screen dimensions
screen_width = 640
screen_height = 480
screen = pygame.display.set_mode((screen_width, screen_height))
pygame.display.set_caption("Snake Game")  # set window title)"]])}   # set window title)"]])}   # set window title)"]})]   # set window title)")}   # set window title}")"}]]}    ]))) }}}}" />)]]]]]]]) }}}}" />)]]]]]]]) }}}}" />)]]]]]"]) }}}}" />)]]]]]"]) }}}}" />)]]}]]}    ]}}}"> ]]"}"> ]]"}"> ]]"}"> ]]"}"> ]]"}> [[[[[[]]])> [[[[[[]]])> [[[[[[]]])> [[

The output of Llama-2-70b-chat-hf is completely wrong. Then I click "Download prompts and parameters" , The content is displayed as follows:

{
  "note": "This is a preview of the prompt that will be sent to the model when retrying the message. It may differ from what was sent in the past if the parameters have been updated since",
  "prompt": "<s>[INST] <<SYS>>\n \n<</SYS>>\n\nCode a basic snake game in python, give explanations for each step. [/INST] ",
  "model": "/data/models/Llama-2-70b-chat-hf/",
  "parameters": {
    "temperature": 0.1,
    "truncate": 1000,
    "max_new_tokens": 1024,
    "stop": [
      "</s>",
      "</s><s>[INST]"
    ],
    "top_p": 0.95,
    "top_k": 50,
    "repetition_penalty": 1.2,
    "stop_sequences": [
      "</s>",
      "</s><s>[INST]"
    ],
    "return_full_text": false
  }
}

The output log of the VLLM is as follows: textgen | INFO 03-08 18:06:42 async_llm_engine.py:436] Received request cmpl-aa52bf8ea8684e36846defd5e5a3f7be: prompt: '<s>[INST] <<SYS>>\n\n<</SYS>>\n\nCode a basic snake game in python, give explanations for each step. [/INST]', prefix_pos: None,sampling_params: SamplingParams(n=1, best_of=1, presence_penalty=0.0, frequency_penalty=1.2, repetition_penalty=1.0, temperature=0.1, top_p=0.95, top_k=-1, min_p=0.0, seed=None, use_beam_search=False, length_penalty=1.0, early_stopping=False, stop=['</s>', '</s><s>[INST]'], stop_token_ids=[], include_stop_str_in_output=False, ignore_eos=False, max_tokens=1024, logprobs=None, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True), prompt_token_ids: [1, 1, 518, 25580, 29962, 3532, 14816, 29903, 6778, 13, 13, 29966, 829, 14816, 29903, 6778, 13, 13, 3399, 263, 6996, 269, 21040, 3748, 297, 3017, 29892, 2367, 7309, 800, 363, 1269, 4331, 29889, 518, 29914, 25580, 29962], lora_request: None. _There's the same question here: "topk" in the "Download prompts and parameters" is 50, but in the output log of the VLLM is -1. That's weird.

Summarizing the above, when chat-ui and vllm are used together, the dialogue output of Llama-2-70b-chat-hf is abnormal.

any progress?

huggingface / chat-ui

When chat-ui and vllm are used together, the dialogue output of Llama-2-70b-chat-hf is abnormal. #917