Closed IAINATDBI closed 5 months ago
Hi @IAINATDBI,
It's a bit messy right now and will be improved over time, but I'll try to answer based on current details.
Can you confirm it's the llama2-7b-hf model that is pulled?
The logs should show the HuggingFace ID being used where applicable, in this case it is looking for specifically meta-llama/Llama-2-7b-hf
from HuggingFace. It will check your local HF cache to see if it has already been downloaded based on that identifier:
triton - INFO - Known model source found for 'llama-2-7b': 'hf:meta-llama/Llama-2-7b-hf'
Can you configure what variant is pulled or is it "hard-coded" right now?
The CLI tool itself its built with some of this extensibility in mind, exposing --source
to help users specify custom or unofficially tested models not in the current list of "known models".
To elaborate a bit:
# Short-hand for "known models":
triton import -m llama-2-7b --backend tensorrtllm
# This is the same internally as running:
triton import -m llama-2-7b --source hf:meta-llama/Llama-2-7b-hf --backend tensorrtllm
# If specifying a --source, the name of the -m/--model arg is arbitrary:
triton import -m my-model --source hf:meta-llama/Llama-2-7b-hf --backend tensorrtllm
However, support for building TRT-LLM models requires some more special care than vLLM at this time, so TRT-LLM support through the triton
CLI is currently restricted to a few well known models because we need to know how to convert the model weights/checkpoints to a TRT-LLM compatible format.
For vLLM, you should generally be able to setup any model that vLLM supports. So for your chat example, this should work fine for vLLM:
triton import -m my-llama --source hf:meta-llama/Llama-2-7b-chat-hf --backend vllm
For TRT-LLM, it would currently require a code change to support a new model. I was able to quickly check if the chat model would work, since it should share most logic with the base llama model for reference:
diff --git a/src/triton_cli/repository.py b/src/triton_cli/repository.py
index bd120d0..bbf3902 100644
--- a/src/triton_cli/repository.py
+++ b/src/triton_cli/repository.py
@@ -79,6 +79,9 @@ SUPPORTED_TRT_LLM_BUILDERS = {
"meta-llama/Llama-2-7b-hf": {
"hf_allow_patterns": ["*.safetensors", "*.json"],
},
+ "meta-llama/Llama-2-7b-chat-hf": {
+ "hf_allow_patterns": ["*.safetensors", "*.json"],
+ },
"gpt2": {
"hf_allow_patterns": ["*.safetensors", "*.json"],
"hf_ignore_patterns": ["onnx/*"],
diff --git a/src/triton_cli/trt_llm/builder.py b/src/triton_cli/trt_llm/builder.py
index e01913a..074b236 100644
--- a/src/triton_cli/trt_llm/builder.py
+++ b/src/triton_cli/trt_llm/builder.py
@@ -3,6 +3,7 @@ import subprocess
CHECKPOINT_MODULE_MAP = {
"meta-llama/Llama-2-7b-hf": "llama",
+ "meta-llama/Llama-2-7b-chat-hf": "llama",
"facebook/opt-125m": "opt",
}
and then this worked for me:
triton -v import -m my-chat-model --source hf:meta-llama/Llama-2-7b-chat-hf --backend tensorrtllm
Overall, we're still figuring out some details around TRT-LLM and how to make it easy for users to bring custom models, or easily understand how to contribute code changes to add support for new models. If you have any feedback, please let us know!
Hi @rmccorm4, @fpetrini15 - thank you for taking the time for the detailed discussion(s). I've been using Triton for a bit now and its been performing well, so this CLI addition is really great to hear.
I'm keen to use the chat variant so that I can get responses that make sense. Looking forward to hearing about great things to come with Triton!
Cheers
Thanks @IAINATDBI, we've marked down a feature request (DLIS-6367) to expand the llama model support for TRT-LLM to be a bit more generic/flexible. I'll modify the title of the issue to better reflect that and keep it open.
HI @rmccorm4 - I tried the vLLM approach as described above and I'm getting an error during triton start
it looks like a cuda memory issue and it advises increasing gpu_memory_utilization
:
Internal: ValueError: The model's max seq len (4096) is larger than the maximum number of tokens that can be stored in KV cache (624)
I increased that to 95% in the model.json
but still fails with same error. It mentions KV cache size too.
I'm running on a dual RTX A6000 workstation (so 2 x 48GB) and we should be able to squeeze the 7B model in. However, I'm aware that you guys mentioned quantization during your talk but not aware what the parameter names etc are for this option. There was a flavor that this feature might be automatic? With quantization I can normally load the 70B variant without issues. Thank you for your help with this.
cheers
As a wild guess I tried "quantization":"gptq" in the model.json and it's now looking for a config file?
Cheers
Here's a shot of nvidia-smi just prior to the server shutting down.
Hey @IAINATDBI, can you open a separate issue to discuss the memory issues and quantization further if needed?
To summarize a few quick points:
model.json
is just a representation of vLLM's AsyncEngineArgs, so you probably need to try something like tensor_parallel_size to 2 for 2 GPUs. I haven't had a chance to test this myself yet. You can see how we initialize the vLLM engine from these args here. CC @oandreeva-nv for vizquantization
question, this is again just a vLLM detail that we passthrough to the vLLM APIs. So whatever documentation on vLLM side should apply here for those details. I believe there are also some pre-quantized models hosted on huggingface for popular models like Llama2 that may work directly as well. The quantization described in our GTC talk was based primarily around TensorRT-LLM.Thanks @rmccorm4, Ill raise any further quant questions separately.
However for this thread, the only way I can get triton to start without failing (even with gpt2) on cuda memory issues (see above image (screenshot) with very large memory hungry stub) is to launch the docker, install triton-cli, load the model, start triton and then docker exec into the container (per #50 ) to issue an infer command. I've even tried it on a different machine.
Hope this helps.
Hi @IAINATDBI ,
We added support for llama-2-7b-chat
as well as llama-3-8b
and llama-3-8b-instruct
for both vLLM and TRT-LLM in the latest release associated with Triton 24.04, please check it out: https://github.com/triton-inference-server/triton_cli/releases/tag/0.0.7.
If you're seeing other issues, please raise a separate issue. Closing this issue based on the title about adding "chat" variant support.
Awesome! Thank you, looking forward to checking this out. Cheers.
Get Outlook for iOShttps://aka.ms/o0ukef
From: Ryan McCormick @.> Sent: Wednesday, May 8, 2024 5:43:43 PM To: triton-inference-server/triton_cli @.> Cc: IAINATDBI @.>; Mention @.> Subject: Re: [triton-inference-server/triton_cli] Support for "chat" variant of Llama-2-7b model (Issue #48)
Hi @IAINATDBIhttps://github.com/IAINATDBI ,
We added support for llama-2-7b-chat as well as llama-3-8b and llama-3-8b-instruct for both vLLM and TRT-LLM in the latest release associated with Triton 24.04, please check it out: https://github.com/triton-inference-server/triton_cli/releases/tag/0.0.7.
If you're seeing other issues, please raise a separate issue. Closing this issue based on the title about adding "chat" variant support.
— Reply to this email directly, view it on GitHubhttps://github.com/triton-inference-server/triton_cli/issues/48#issuecomment-2101530537, or unsubscribehttps://github.com/notifications/unsubscribe-auth/A6KEM2ZYLXIYPEHUJPK5WE3ZBKMA7AVCNFSM6AAAAABFFVWVMGVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCMBRGUZTANJTG4. You are receiving this because you were mentioned.Message ID: @.***>
Successfully ran inference with llama-2-7b. Can you confirm it's the llama2-7b-hf model that is pulled? From the logs it looks like it pulled that one from my cache.
Would the "chat" model not be better for a conversational inference experience? Can you configure what variant is pulled or is it "hard-coded" right now?
Cheers