triton-inference-server / tensorrtllm_backend

The Triton TensorRT-LLM Backend
Apache License 2.0
706 stars 106 forks source link

the inflight_batcher_llm_client do not reproduction #315

Open lyc728 opened 9 months ago

byshiue commented 9 months ago

Sorry, I cannot get your question. Please follow the issue template to share the reproduced steps and the issue you observed.

lyc728 commented 9 months ago

git clone https://github.com/triton-inference-server/tensorrtllm_backend.git -b v0.7.0 git lfs install git submodule update --init --recursive

Use the Dockerfile to build the backend in a container For x86_64 DOCKER_BUILDKIT=1 docker build -t triton_trt_llm -f dockerfile/Dockerfile.trt_llm_backend .

Now I use the vision 0.7.0,and the build python build.py --hf_model_dir /data/LLM/Translation/llms/Qwen-7B-Chat/ --dtype float16 --remove_input_padding --use_gpt_attention_plugin float16 --use_gemm_plugin float16 --use_weight_only --weight_only_precision int4 --use_inflight_batching --output_dir /data/tensorrtllm_backend13/engine_outputs2_inf/

And start triton server , and python inflight_batcher_llm/client/inflight_batcher_llm_client.py --request-output-len 150 -S the GPU is 33G , if i do not the use_inflight_batching ,the gpu is 12G

byshiue commented 9 months ago

Sorry. I still not able to see what question you want to ask.

lyc728 commented 9 months ago

git clone https://github.com/triton-inference-server/tensorrtllm_backend.git -b v0.7.0 git lfs install git submodule update --init --recursive

Use the Dockerfile to build the backend in a container For x86_64 DOCKER_BUILDKIT=1 docker build -t triton_trt_llm -f dockerfile/Dockerfile.trt_llm_backend .

Now I use the vision 0.7.0,and the build python build.py --hf_model_dir /data/LLM/Translation/llms/Qwen-7B-Chat/ --dtype float16 --remove_input_padding --use_gpt_attention_plugin float16 --use_gemm_plugin float16 --use_weight_only --weight_only_precision int4 --use_inflight_batching --output_dir /data/tensorrtllm_backend13/engine_outputs2_inf/

And start triton server , and python inflight_batcher_llm/client/inflight_batcher_llm_client.py --request-output-len 150 -S the GPU is 33G , if i do not the use_inflight_batching ,the gpu is 12G

this is new question, when I use the inflight_batching ,the server gpu 33G. when I close the inflight_batching, the server gpu 12G,Is this kind of gpu memory usage normal ?

byshiue commented 9 months ago

For inflight batching, we have paged kv cache and require to allocate a memory pool to handle the kv cache. I think that's the reason.

lyc728 commented 9 months ago

Hello, I conducted a multi-threaded concurrent request here, and found that triton added inflighting batch or V1 in the configuration, and the time was almost unchanged

`import concurrent.futures import subprocess import time import random start = time.time()

def send_request(command): process = subprocess.Popen(command, shell=True) process.wait() return f"Command '{command}' executed successfully."

if name == "main": texts = [] commands = [ "python inflight_batcher_llm/client/inflight_batcher_llm_client.py --request-output-len 100 -S --text '{}' --request-id {:d}".format(random.choice(texts), i) for i in range(1, 1000) ]

with concurrent.futures.ThreadPoolExecutor() as executor:
    # print(commands)
    results = list(executor.map(send_request, commands))

for result in results:
    print(result)
print(time.time() - start)`
lyc728 commented 9 months ago

https://github.com/NVIDIA/TensorRT-LLM/tree/v0.7.0/examples/qwen There is no In-flight Batching of the Support Matrix here, and v0.7.1 is no In-flight Batching of the Support Matrix here howerver https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/qwen is support the In-flight Batching, I wonder if this really works

pcastonguay commented 9 months ago

Inflight batching doesn't always provide an improvement in throughput and/or latency. It really depends on your dataset. Datasets with uniform input and output lengths are less likely to benefit from inflight batching.

lyc728 commented 9 months ago

Hello, thank you for your reply. I also have a question when I use the triton directly (gpt type as V1). Can it play an accelerated role compared to trt-llm? Where is it embodied?