triton-inference-server / server

The Triton Inference Server provides an optimized cloud and edge inferencing solution.
https://docs.nvidia.com/deeplearning/triton-inference-server/user-guide/docs/index.html
BSD 3-Clause "New" or "Revised" License
8.14k stars 1.46k forks source link

infer_request.exec() run slowly #7413

Open CallmeZhangChenchen opened 3 months ago

CallmeZhangChenchen commented 3 months ago

Description infer_request.exec() run slowly

Triton Information nvcr.io/nvidia/tritonserver:24.05-py3

To Reproduce

/usr/src/tensorrt/bin/trtexec --onnx=test_static.onnx --builderOptimizationLevel=5 --useCudaGraph --noDataTransfers --useSpinWait --fp16 --saveEngine=test.engine

GPU latency

[07/04/2024-01:59:40] [I] Average on 10 runs - GPU latency: 1.8825 ms - Host latency: 1.8825 ms (enqueue 0.00217285 ms)
[07/04/2024-01:59:40] [I] Average on 10 runs - GPU latency: 1.88 ms - Host latency: 1.88 ms (enqueue 0.00224609 ms)
[07/04/2024-01:59:40] [I] Average on 10 runs - GPU latency: 1.87954 ms - Host latency: 1.87954 ms (enqueue 0.00212402 ms)
[07/04/2024-01:59:40] [I] Average on 10 runs - GPU latency: 1.87886 ms - Host latency: 1.87886 ms (enqueue 0.00209961 ms)
[07/04/2024-01:59:40] [I] Average on 10 runs - GPU latency: 1.88 ms - Host latency: 1.88 ms (enqueue 0.00227051 ms)
[07/04/2024-01:59:40] [I] Average on 10 runs - GPU latency: 1.88022 ms - Host latency: 1.88022 ms (enqueue 0.0020752 ms)
[07/04/2024-01:59:40] [I] Average on 10 runs - GPU latency: 1.87974 ms - Host latency: 1.87974 ms (enqueue 0.00229492 ms)
[07/04/2024-01:59:40] [I] Average on 10 runs - GPU latency: 1.88054 ms - Host latency: 1.88054 ms (enqueue 0.00217285 ms)
[07/04/2024-01:59:40] [I] Average on 10 runs - GPU latency: 1.87844 ms - Host latency: 1.87844 ms (enqueue 0.00214844 ms)
[07/04/2024-01:59:40] [I] Average on 10 runs - GPU latency: 1.88047 ms - Host latency: 1.88047 ms (enqueue 0.00212402 ms)
[07/04/2024-01:59:40] [I] Average on 10 runs - GPU latency: 1.88176 ms - Host latency: 1.88176 ms (enqueue 0.0020752 ms)
[07/04/2024-01:59:40] [I] Average on 10 runs - GPU latency: 1.88079 ms - Host latency: 1.88079 ms (enqueue 0.00214844 ms)

Call in bls mode

for idx in range(1, 545):
            y = pb_utils.Tensor.from_dlpack("iy", to_dlpack(torch.randint(50,500, (1,529), device='cuda')))
            k = pb_utils.Tensor.from_dlpack("ik", to_dlpack(torch.rand(24, 1, 747, 512,device='cuda')))
            v = pb_utils.Tensor.from_dlpack("iv", to_dlpack(torch.rand(24, 1, 747, 512,device='cuda')))
            xy_pos = pb_utils.Tensor.from_dlpack("ixy_pos", to_dlpack(torch.rand(1,1,512,device='cuda')))
            y_len = torch.tensor([155], device='cuda')

            infer_request = pb_utils.InferenceRequest(
                model_name="t2s_sdec",
                inputs=[y,k,v,pb_utils.Tensor.from_dlpack("iidx", to_dlpack(y_len-154)), xy_pos, pb_utils.Tensor.from_dlpack("rand_tensor", to_dlpack(torch.rand(1,1025,device='cuda')))],
                requested_output_names=["y","k", "v", "logits", "samples","xy_pos"],
            )

            b = time.time()
            infer_responses = infer_request.exec()
            print((time.time() - b))
            totaltime = totaltime + (time.time() - b)

output time

0.009646892547607422s
torch.Size([1, 529]) torch.Size([24, 1, 747, 512])
0.009752035140991211s
torch.Size([1, 529]) torch.Size([24, 1, 747, 512])
0.009808540344238281s
torch.Size([1, 529]) torch.Size([24, 1, 747, 512])
0.009853363037109375s

Expected behavior

Normally, the input of infer_request.exec() is on the GPU and the output is on the GPU, and the time should be about 2ms, but the actual situation is 9ms

CallmeZhangChenchen commented 3 months ago

I'm guessing the data transfer in my code looks something like this:

In that case, I can only:

CallmeZhangChenchen commented 3 months ago

When I move the Input data to the CPU, it takes a lot of time, so I rule out the assumption that Input is placed on the CPU

y = pb_utils.Tensor('iy', torch.randint(50,500, (1,529)).numpy())
            k = pb_utils.Tensor('ik', torch.rand(24, 1, 747, 512).numpy())
            v = pb_utils.Tensor('iv', torch.rand(24, 1, 747, 512).numpy())
            xy_pos = pb_utils.Tensor('ixy_pos',torch.rand(1,1,512).numpy())
            idx = pb_utils.Tensor('iidx',torch.tensor([1]).numpy())
            rand_tensor = pb_utils.Tensor('rand_tensor', torch.rand(1,1025).numpy())

            infer_request = pb_utils.InferenceRequest(
                model_name="t2s_sdec",
                inputs=[y,k,v,idx, xy_pos, rand_tensor],
                requested_output_names=["y","k", "v", "logits", "samples","xy_pos"],
            )

            b = time.time()
            infer_responses = infer_request.exec()#decoupled=True)
            print((time.time() - b))
            totaltime = totaltime + (time.time() - b)
0.026983022689819336s
0.02678084373474121s
0.02610945701599121s
0.028659343719482422s
0.023781538009643555s
CallmeZhangChenchen commented 3 months ago

Minimize the amount of data transferred is OK After the input and output data types are halved (float32-> float16)

0.006856203079223633s
torch.Size([1, 529]) torch.Size([24, 1, 747, 512])
0.005406618118286133s
torch.Size([1, 529]) torch.Size([24, 1, 747, 512])
0.006296873092651367s
torch.Size([1, 529]) torch.Size([24, 1, 747, 512])
0.005514383316040039s
torch.Size([1, 529]) torch.Size([24, 1, 747, 512])
0.006400585174560547s
torch.Size([1, 529]) torch.Size([24, 1, 747, 512])
0.005585670471191406s
CallmeZhangChenchen commented 3 months ago

So the internal data transfer should look something like this

BLS: GPU memory Tensorrt: GPU memory copy -> Tensorrt Core -> GPU memory copy BLS: -> GPU memory

There should be some kind of transfer between BLS: GPU memory and Tensorrt: GPU memory that causes the speed to increase from 2ms to the current 6ms

What should I do to achieve the ultimate 2ms?

oandreeva-nv commented 3 months ago

Hi @CallmeZhangChenchen , could you please verify that FORCE_CPU_ONLY_INPUT_TENSORS is set to "no" for all your models. Reference

CallmeZhangChenchen commented 3 months ago

Hi! @oandreeva-nv ,Thank you for your attention.

All models are set to "no"

Here I have minimized and packed a package, mainly the model is too large, there is also a README.md, just need to convert the model, start the service will do

https://drive.google.com/file/d/17xGB0dEQ4ybvKUpQOlv8gTczfdBZnIKJ/view?usp=sharing

thanks a million!

CallmeZhangChenchen commented 2 months ago

The python API for direct TensorRT is 4ms and using infer_request.exec() is 6ms, so I'll abandon the BLS model and use the API for direct calls

    import time; 
    begin = time.time()
    # Run inference.
    execute_async_func()
    cuda_call(cudart.cudaStreamSynchronize(stream))
    print(time.time() - begin)
0.004782199859619141s
0.004822254180908203s
0.004093170166015625s
0.0040912628173828125s
0.004096508026123047s
0.004090070724487305s
Tabrizian commented 1 month ago

Could you please try to increase the --cuda-memory-pool-byte-size to see if it helps? If you run out of CUDA memory pool, the cross memory data transfers can take longer.