Open lyc728 opened 9 months ago
git clone https://github.com/triton-inference-server/tensorrtllm_backend.git -b v0.7.0 git lfs install git submodule update --init --recursive
Use the Dockerfile to build the backend in a container For x86_64 DOCKER_BUILDKIT=1 docker build -t triton_trt_llm -f dockerfile/Dockerfile.trt_llm_backend .
Now I use the vision 0.7.0,and the build python build.py --hf_model_dir /data/LLM/Translation/llms/Qwen-7B-Chat/ --dtype float16 --remove_input_padding --use_gpt_attention_plugin float16 --use_gemm_plugin float16 --use_weight_only --weight_only_precision int4 --use_inflight_batching --output_dir /data/tensorrtllm_backend13/engine_outputs2_inf/
And start triton server , and python inflight_batcher_llm/client/inflight_batcher_llm_client.py --request-output-len 150 -S the GPU is 33G , if i do not the use_inflight_batching ,the gpu is 12G
Sorry. I still not able to see what question you want to ask.
git clone https://github.com/triton-inference-server/tensorrtllm_backend.git -b v0.7.0 git lfs install git submodule update --init --recursive
Use the Dockerfile to build the backend in a container For x86_64 DOCKER_BUILDKIT=1 docker build -t triton_trt_llm -f dockerfile/Dockerfile.trt_llm_backend .
Now I use the vision 0.7.0,and the build python build.py --hf_model_dir /data/LLM/Translation/llms/Qwen-7B-Chat/ --dtype float16 --remove_input_padding --use_gpt_attention_plugin float16 --use_gemm_plugin float16 --use_weight_only --weight_only_precision int4 --use_inflight_batching --output_dir /data/tensorrtllm_backend13/engine_outputs2_inf/
And start triton server , and python inflight_batcher_llm/client/inflight_batcher_llm_client.py --request-output-len 150 -S the GPU is 33G , if i do not the use_inflight_batching ,the gpu is 12G
this is new question, when I use the inflight_batching ,the server gpu 33G. when I close the inflight_batching, the server gpu 12G,Is this kind of gpu memory usage normal ?
For inflight batching, we have paged kv cache and require to allocate a memory pool to handle the kv cache. I think that's the reason.
Hello, I conducted a multi-threaded concurrent request here, and found that triton added inflighting batch or V1 in the configuration, and the time was almost unchanged
`import concurrent.futures import subprocess import time import random start = time.time()
def send_request(command): process = subprocess.Popen(command, shell=True) process.wait() return f"Command '{command}' executed successfully."
if name == "main": texts = [] commands = [ "python inflight_batcher_llm/client/inflight_batcher_llm_client.py --request-output-len 100 -S --text '{}' --request-id {:d}".format(random.choice(texts), i) for i in range(1, 1000) ]
with concurrent.futures.ThreadPoolExecutor() as executor:
# print(commands)
results = list(executor.map(send_request, commands))
for result in results:
print(result)
print(time.time() - start)`
https://github.com/NVIDIA/TensorRT-LLM/tree/v0.7.0/examples/qwen There is no In-flight Batching of the Support Matrix here, and v0.7.1 is no In-flight Batching of the Support Matrix here howerver https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/qwen is support the In-flight Batching, I wonder if this really works
Inflight batching doesn't always provide an improvement in throughput and/or latency. It really depends on your dataset. Datasets with uniform input and output lengths are less likely to benefit from inflight batching.
Hello, thank you for your reply. I also have a question when I use the triton directly (gpt type as V1). Can it play an accelerated role compared to trt-llm? Where is it embodied?
Sorry, I cannot get your question. Please follow the issue template to share the reproduced steps and the issue you observed.