Bug--Llama-2-70b-chat-hf error: `truncate` must be strictly positive and less than 1024. Given: 3072

majestichou commented 8 months ago

I use the docker image chat-ui-db as the frontend, text-generation-inference as the inference backend, and meta-llamaLlama-2-70b-chat-hf as the model. In the model field of the .env.local file, I have the following settings

MODELS=`[
     {
      "name": "meta-llama/Llama-2-70b-chat-hf",
      "endpoints": [{
        "type" : "tgi",
        "url": "http://textgen:80",
        }],
      "preprompt": " ",
      "chatPromptTemplate" : "<s>[INST] <<SYS>>\n{{preprompt}}\n<</SYS>>\n\n{{#each messages}}{{#ifUser}}{{content}} [/INST] {{/ifUser}}{{#ifAssistant}}{{content}} </s><s>[INST] {{/ifAssistant}}{{/each}}",
      "promptExamples": [
        {
          "title": "Write an email from bullet list",
          "prompt": "As a restaurant owner, write a professional email to the supplier to get these products every week: \n\n- Wine (x10)\n- Eggs (x24)\n- Bread (x12)"
        }, {
          "title": "Code a snake game",
          "prompt": "Code a basic snake game in python, give explanations for each step."
        }, {
          "title": "Assist in a task",
          "prompt": "How do I make a delicious lemon cheesecake?"
        }
      ],
      "parameters": {
        "temperature": 0.1,
        "top_p": 0.95,
        "repetition_penalty": 1.2,
        "top_k": 50,
        "truncate": 3072,
        "max_new_tokens": 1024,
        "stop" : ["</s>", "</s><s>[INST]"]
      }
    }
]`

This setting is the same as the setting for Llama-2-70b-chat-hf in the .env.template file in the chat-ui repository. Then I type the question in the input box. An error has occurred. The following error information is found in the log:

textgen  | 2024-03-05T20:00:38.883413Z ERROR compat_generate{default_return_full_text=false compute_type=Extension(ComputeType("8-nvidia-a100-sxm4-40gb"))}:generate_stream{parameters=GenerateParameters { best_of: None, temperature: Some(0.1), repetition_penalty: Some(1.2), frequency_penalty: None, top_k: Some(50), top_p: Some(0.95), typical_p: None, do_sample: false, max_new_tokens: Some(1024), return_full_text: Some(false), stop: ["</s>", "</s><s>[INST]"], truncate: Some(3072), watermark: false, details: false, decoder_input_details: false, seed: None, top_n_tokens: None, grammar: None }}:async_stream:generate_stream: text_generation_router::infer: router/src/infer.rs:123: `truncate` must be strictly positive and less than 1024. Given: 3072
chat-ui  | Error: Input validation error: `truncate` must be strictly positive and less than 1024. Given: 3072
chat-ui  |     at streamingRequest (file:///app/node_modules/@huggingface/inference/dist/index.mjs:323:19)
chat-ui  |     at process.processTicksAndRejections (node:internal/process/task_queues:95:5)
chat-ui  |     at async textGenerationStream (file:///app/node_modules/@huggingface/inference/dist/index.mjs:673:3)
chat-ui  |     at async generateFromDefaultEndpoint (file:///app/.svelte-kit/output/server/entries/endpoints/conversation/_id_/_server.ts.js:39:20)
chat-ui  |     at async summarize (file:///app/.svelte-kit/output/server/entries/endpoints/conversation/_id_/_server.ts.js:287:10)
chat-ui  |     at async file:///app/.svelte-kit/output/server/entries/endpoints/conversation/_id_/_server.ts.js:607:26
textgen  | 2024-03-05T20:00:38.910266Z ERROR compat_generate{default_return_full_text=false compute_type=Extension(ComputeType("8-nvidia-a100-sxm4-40gb"))}:generate_stream{parameters=GenerateParameters { best_of: None, temperature: Some(0.1), repetition_penalty: Some(1.2), frequency_penalty: None, top_k: Some(50), top_p: Some(0.95), typical_p: None, do_sample: false, max_new_tokens: Some(1024), return_full_text: Some(false), stop: ["</s>", "</s><s>[INST]"], truncate: Some(3072), watermark: false, details: false, decoder_input_details: false, seed: None, top_n_tokens: None, grammar: None }}:async_stream:generate_stream: text_generation_router::infer: router/src/infer.rs:123: `truncate` must be strictly positive and less than 1024. Given: 3072

I set "truncate" to 1000, everything is ok. "truncate" for Llama-2-70b-chat-hf in the .env.template file in the chat-ui repository is 3072. I think the 3072 should work fine. I don't know how webpage https://huggingface.co/chat/ sets this parameter.

nsarrazin commented 8 months ago

Maybe you could try tweaking MAX_TOTAL_TOKENS in TGI? (see docs) not an expert on TGI though. What command are you using to deploy it ?

majestichou commented 8 months ago

Maybe you could try tweaking MAX_TOTAL_TOKENS in TGI? (see docs) not an expert on TGI though. What command are you using to deploy it ?

I use docker compose to deploy it. The content of docker-compose.yml is as follows:

services:
  chat-ui:
    image: chat-ui-db:latest
    ports:
      - "3000:3000"
    restart: unless-stopped
  textgen:
    image: huggingface/text-generation-inference:1.4
    ports:
      - "8080:80"
    command: ["--model-id", "/data/models/meta-llamaLlama-2-70b-chat-hf"]
    volumes:
     - /home/test/llm-test/serving/data:/data
    deploy:
      resources:
        reservations:
          devices:
          - driver: nvidia
            count: 8
            capabilities: [gpu]
    restart: unless-stopped

nsarrazin commented 8 months ago

Can you try adding the max total tokens above in your command section ?

majestichou commented 8 months ago

Can you try adding the max total tokens above in your command section ?

According the config of "meta-llama/Llama-2-70b-chat-hf", "max_position_embeddings" is 4096. So I try to tweak MAX_TOTAL_TOKENS to 4096 in TGI. The content of docker-compose.yml is as follows:

services:
  chat-ui:
    container_name: chat-ui
    image: chat-ui-db:latest
    ports:
      - "3000:3000"
    restart: unless-stopped

  textgen:
    container_name: textgen
    image: huggingface/text-generation-inference:1.4
    ports:
      - "8080:80"
    command: ["--model-id", "/data/models/meta-llamaLlama-2-70b-chat-hf", "--max-total-tokens", "4096"]
    volumes:
     - /home/mnt/h00562948/llm-test/serving/data:/data
    deploy:
      resources:
        reservations:
          devices:
          - driver: nvidia
            count: 8
            capabilities: [gpu]
    restart: unless-stopped

Then I start the service by docker compose up. Then I type the question in the input box. The same error occurred.

chat-ui  | {"t":{"$date":"2024-03-06T08:48:58.458+00:00"},"s":"I",  "c":"NETWORK",  "id":22943,   "ctx":"listener","msg":"Connection accepted","attr":{"remote":"127.0.0.1:38924","uuid":{"uuid":{"$uuid":"e7fa75ea-1470-4ebb-8e22-79ed3dba4dc7"}},"connectionId":25,"connectionCount":8}}
chat-ui  | {"t":{"$date":"2024-03-06T08:48:58.459+00:00"},"s":"I",  "c":"NETWORK",  "id":51800,   "ctx":"conn25","msg":"client metadata","attr":{"remote":"127.0.0.1:38924","client":"conn25","negotiatedCompressors":[],"doc":{"driver":{"name":"nodejs","version":"5.8.0"},"platform":"Node.js v20.11.1, LE","os":{"name":"linux","architecture":"x64","version":"3.10.0-1160.el7.x86_64","type":"Linux"}}}}
textgen  | 2024-03-06T08:49:16.377360Z ERROR compat_generate{default_return_full_text=false compute_type=Extension(ComputeType("8-nvidia-a100-sxm4-40gb"))}:generate_stream{parameters=GenerateParameters { best_of: None, temperature: Some(0.1), repetition_penalty: Some(1.2), frequency_penalty: None, top_k: Some(50), top_p: Some(0.95), typical_p: None, do_sample: false, max_new_tokens: Some(1024), return_full_text: Some(false), stop: ["</s>", "</s><s>[INST]"], truncate: Some(3072), watermark: false, details: false, decoder_input_details: false, seed: None, top_n_tokens: None, grammar: None }}:async_stream:generate_stream: text_generation_router::infer: router/src/infer.rs:123: `truncate` must be strictly positive and less than 1024. Given: 3072
chat-ui  | Error: Input validation error: `truncate` must be strictly positive and less than 1024. Given: 3072
chat-ui  |     at streamingRequest (file:///app/node_modules/@huggingface/inference/dist/index.mjs:323:19)
chat-ui  |     at process.processTicksAndRejections (node:internal/process/task_queues:95:5)
chat-ui  |     at async textGenerationStream (file:///app/node_modules/@huggingface/inference/dist/index.mjs:673:3)
chat-ui  |     at async generateFromDefaultEndpoint (file:///app/.svelte-kit/output/server/entries/endpoints/conversation/_id_/_server.ts.js:39:20)
chat-ui  |     at async summarize (file:///app/.svelte-kit/output/server/entries/endpoints/conversation/_id_/_server.ts.js:287:10)
chat-ui  |     at async file:///app/.svelte-kit/output/server/entries/endpoints/conversation/_id_/_server.ts.js:607:26
textgen  | 2024-03-06T08:49:16.405328Z ERROR compat_generate{default_return_full_text=false compute_type=Extension(ComputeType("8-nvidia-a100-sxm4-40gb"))}:generate_stream{parameters=GenerateParameters { best_of: None, temperature: Some(0.1), repetition_penalty: Some(1.2), frequency_penalty: None, top_k: Some(50), top_p: Some(0.95), typical_p: None, do_sample: false, max_new_tokens: Some(1024), return_full_text: Some(false), stop: ["</s>", "</s><s>[INST]"], truncate: Some(3072), watermark: false, details: false, decoder_input_details: false, seed: None, top_n_tokens: None, grammar: None }}:async_stream:generate_stream: text_generation_router::infer: router/src/infer.rs:123: `truncate` must be strictly positive and less than 1024. Given: 3072

huggingface / chat-ui

Bug--Llama-2-70b-chat-hf error: `truncate` must be strictly positive and less than 1024. Given: 3072 #899