Open majestichou opened 8 months ago
Maybe you could try tweaking MAX_TOTAL_TOKENS
in TGI? (see docs) not an expert on TGI though. What command are you using to deploy it ?
Maybe you could try tweaking
MAX_TOTAL_TOKENS
in TGI? (see docs) not an expert on TGI though. What command are you using to deploy it ?
I use docker compose to deploy it. The content of docker-compose.yml is as follows:
services:
chat-ui:
image: chat-ui-db:latest
ports:
- "3000:3000"
restart: unless-stopped
textgen:
image: huggingface/text-generation-inference:1.4
ports:
- "8080:80"
command: ["--model-id", "/data/models/meta-llamaLlama-2-70b-chat-hf"]
volumes:
- /home/test/llm-test/serving/data:/data
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: 8
capabilities: [gpu]
restart: unless-stopped
Can you try adding the max total tokens above in your command
section ?
Can you try adding the max total tokens above in your
command
section ?
According the config of "meta-llama/Llama-2-70b-chat-hf", "max_position_embeddings" is 4096. So I try to tweak MAX_TOTAL_TOKENS to 4096 in TGI. The content of docker-compose.yml is as follows:
services:
chat-ui:
container_name: chat-ui
image: chat-ui-db:latest
ports:
- "3000:3000"
restart: unless-stopped
textgen:
container_name: textgen
image: huggingface/text-generation-inference:1.4
ports:
- "8080:80"
command: ["--model-id", "/data/models/meta-llamaLlama-2-70b-chat-hf", "--max-total-tokens", "4096"]
volumes:
- /home/mnt/h00562948/llm-test/serving/data:/data
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: 8
capabilities: [gpu]
restart: unless-stopped
Then I start the service by docker compose up
.
Then I type the question in the input box. The same error occurred.
chat-ui | {"t":{"$date":"2024-03-06T08:48:58.458+00:00"},"s":"I", "c":"NETWORK", "id":22943, "ctx":"listener","msg":"Connection accepted","attr":{"remote":"127.0.0.1:38924","uuid":{"uuid":{"$uuid":"e7fa75ea-1470-4ebb-8e22-79ed3dba4dc7"}},"connectionId":25,"connectionCount":8}}
chat-ui | {"t":{"$date":"2024-03-06T08:48:58.459+00:00"},"s":"I", "c":"NETWORK", "id":51800, "ctx":"conn25","msg":"client metadata","attr":{"remote":"127.0.0.1:38924","client":"conn25","negotiatedCompressors":[],"doc":{"driver":{"name":"nodejs","version":"5.8.0"},"platform":"Node.js v20.11.1, LE","os":{"name":"linux","architecture":"x64","version":"3.10.0-1160.el7.x86_64","type":"Linux"}}}}
textgen | 2024-03-06T08:49:16.377360Z ERROR compat_generate{default_return_full_text=false compute_type=Extension(ComputeType("8-nvidia-a100-sxm4-40gb"))}:generate_stream{parameters=GenerateParameters { best_of: None, temperature: Some(0.1), repetition_penalty: Some(1.2), frequency_penalty: None, top_k: Some(50), top_p: Some(0.95), typical_p: None, do_sample: false, max_new_tokens: Some(1024), return_full_text: Some(false), stop: ["</s>", "</s><s>[INST]"], truncate: Some(3072), watermark: false, details: false, decoder_input_details: false, seed: None, top_n_tokens: None, grammar: None }}:async_stream:generate_stream: text_generation_router::infer: router/src/infer.rs:123: `truncate` must be strictly positive and less than 1024. Given: 3072
chat-ui | Error: Input validation error: `truncate` must be strictly positive and less than 1024. Given: 3072
chat-ui | at streamingRequest (file:///app/node_modules/@huggingface/inference/dist/index.mjs:323:19)
chat-ui | at process.processTicksAndRejections (node:internal/process/task_queues:95:5)
chat-ui | at async textGenerationStream (file:///app/node_modules/@huggingface/inference/dist/index.mjs:673:3)
chat-ui | at async generateFromDefaultEndpoint (file:///app/.svelte-kit/output/server/entries/endpoints/conversation/_id_/_server.ts.js:39:20)
chat-ui | at async summarize (file:///app/.svelte-kit/output/server/entries/endpoints/conversation/_id_/_server.ts.js:287:10)
chat-ui | at async file:///app/.svelte-kit/output/server/entries/endpoints/conversation/_id_/_server.ts.js:607:26
textgen | 2024-03-06T08:49:16.405328Z ERROR compat_generate{default_return_full_text=false compute_type=Extension(ComputeType("8-nvidia-a100-sxm4-40gb"))}:generate_stream{parameters=GenerateParameters { best_of: None, temperature: Some(0.1), repetition_penalty: Some(1.2), frequency_penalty: None, top_k: Some(50), top_p: Some(0.95), typical_p: None, do_sample: false, max_new_tokens: Some(1024), return_full_text: Some(false), stop: ["</s>", "</s><s>[INST]"], truncate: Some(3072), watermark: false, details: false, decoder_input_details: false, seed: None, top_n_tokens: None, grammar: None }}:async_stream:generate_stream: text_generation_router::infer: router/src/infer.rs:123: `truncate` must be strictly positive and less than 1024. Given: 3072
I use the docker image chat-ui-db as the frontend, text-generation-inference as the inference backend, and meta-llamaLlama-2-70b-chat-hf as the model. In the model field of the .env.local file, I have the following settings
This setting is the same as the setting for Llama-2-70b-chat-hf in the .env.template file in the chat-ui repository. Then I type the question in the input box. An error has occurred. The following error information is found in the log:
I set "truncate" to 1000, everything is ok. "truncate" for Llama-2-70b-chat-hf in the .env.template file in the chat-ui repository is 3072. I think the 3072 should work fine. I don't know how webpage https://huggingface.co/chat/ sets this parameter.