infer_request.exec() run slowly

CallmeZhangChenchen commented 3 months ago

Description infer_request.exec() run slowly

Triton Information nvcr.io/nvidia/tritonserver:24.05-py3

To Reproduce

/usr/src/tensorrt/bin/trtexec --onnx=test_static.onnx --builderOptimizationLevel=5 --useCudaGraph --noDataTransfers --useSpinWait --fp16 --saveEngine=test.engine

GPU latency

[07/04/2024-01:59:40] [I] Average on 10 runs - GPU latency: 1.8825 ms - Host latency: 1.8825 ms (enqueue 0.00217285 ms)
[07/04/2024-01:59:40] [I] Average on 10 runs - GPU latency: 1.88 ms - Host latency: 1.88 ms (enqueue 0.00224609 ms)
[07/04/2024-01:59:40] [I] Average on 10 runs - GPU latency: 1.87954 ms - Host latency: 1.87954 ms (enqueue 0.00212402 ms)
[07/04/2024-01:59:40] [I] Average on 10 runs - GPU latency: 1.87886 ms - Host latency: 1.87886 ms (enqueue 0.00209961 ms)
[07/04/2024-01:59:40] [I] Average on 10 runs - GPU latency: 1.88 ms - Host latency: 1.88 ms (enqueue 0.00227051 ms)
[07/04/2024-01:59:40] [I] Average on 10 runs - GPU latency: 1.88022 ms - Host latency: 1.88022 ms (enqueue 0.0020752 ms)
[07/04/2024-01:59:40] [I] Average on 10 runs - GPU latency: 1.87974 ms - Host latency: 1.87974 ms (enqueue 0.00229492 ms)
[07/04/2024-01:59:40] [I] Average on 10 runs - GPU latency: 1.88054 ms - Host latency: 1.88054 ms (enqueue 0.00217285 ms)
[07/04/2024-01:59:40] [I] Average on 10 runs - GPU latency: 1.87844 ms - Host latency: 1.87844 ms (enqueue 0.00214844 ms)
[07/04/2024-01:59:40] [I] Average on 10 runs - GPU latency: 1.88047 ms - Host latency: 1.88047 ms (enqueue 0.00212402 ms)
[07/04/2024-01:59:40] [I] Average on 10 runs - GPU latency: 1.88176 ms - Host latency: 1.88176 ms (enqueue 0.0020752 ms)
[07/04/2024-01:59:40] [I] Average on 10 runs - GPU latency: 1.88079 ms - Host latency: 1.88079 ms (enqueue 0.00214844 ms)

Call in bls mode

for idx in range(1, 545):
            y = pb_utils.Tensor.from_dlpack("iy", to_dlpack(torch.randint(50,500, (1,529), device='cuda')))
            k = pb_utils.Tensor.from_dlpack("ik", to_dlpack(torch.rand(24, 1, 747, 512,device='cuda')))
            v = pb_utils.Tensor.from_dlpack("iv", to_dlpack(torch.rand(24, 1, 747, 512,device='cuda')))
            xy_pos = pb_utils.Tensor.from_dlpack("ixy_pos", to_dlpack(torch.rand(1,1,512,device='cuda')))
            y_len = torch.tensor([155], device='cuda')

            infer_request = pb_utils.InferenceRequest(
                model_name="t2s_sdec",
                inputs=[y,k,v,pb_utils.Tensor.from_dlpack("iidx", to_dlpack(y_len-154)), xy_pos, pb_utils.Tensor.from_dlpack("rand_tensor", to_dlpack(torch.rand(1,1025,device='cuda')))],
                requested_output_names=["y","k", "v", "logits", "samples","xy_pos"],
            )

            b = time.time()
            infer_responses = infer_request.exec()
            print((time.time() - b))
            totaltime = totaltime + (time.time() - b)

output time

0.009646892547607422s
torch.Size([1, 529]) torch.Size([24, 1, 747, 512])
0.009752035140991211s
torch.Size([1, 529]) torch.Size([24, 1, 747, 512])
0.009808540344238281s
torch.Size([1, 529]) torch.Size([24, 1, 747, 512])
0.009853363037109375s

Expected behavior

Normally, the input of infer_request.exec() is on the GPU and the output is on the GPU, and the time should be about 2ms, but the actual situation is 9ms

CallmeZhangChenchen commented 3 months ago

I'm guessing the data transfer in my code looks something like this：

BLS： GPU memory -> CPU shared memory
Tensorrt：CPU shared memory -> GPU memory -> Tensorrt Core -> GPU memory -> CPU shared memory
BLS： CPU shared memory -> GPU memory

In that case, I can only:

Minimize the amount of data transferred
Input is placed on the CPU

CallmeZhangChenchen commented 3 months ago

When I move the Input data to the CPU, it takes a lot of time, so I rule out the assumption that Input is placed on the CPU

y = pb_utils.Tensor('iy', torch.randint(50,500, (1,529)).numpy())
            k = pb_utils.Tensor('ik', torch.rand(24, 1, 747, 512).numpy())
            v = pb_utils.Tensor('iv', torch.rand(24, 1, 747, 512).numpy())
            xy_pos = pb_utils.Tensor('ixy_pos',torch.rand(1,1,512).numpy())
            idx = pb_utils.Tensor('iidx',torch.tensor([1]).numpy())
            rand_tensor = pb_utils.Tensor('rand_tensor', torch.rand(1,1025).numpy())

            infer_request = pb_utils.InferenceRequest(
                model_name="t2s_sdec",
                inputs=[y,k,v,idx, xy_pos, rand_tensor],
                requested_output_names=["y","k", "v", "logits", "samples","xy_pos"],
            )

            b = time.time()
            infer_responses = infer_request.exec()#decoupled=True)
            print((time.time() - b))
            totaltime = totaltime + (time.time() - b)

0.026983022689819336s
0.02678084373474121s
0.02610945701599121s
0.028659343719482422s
0.023781538009643555s

CallmeZhangChenchen commented 3 months ago

Minimize the amount of data transferred is OK After the input and output data types are halved (float32-> float16)

0.006856203079223633s
torch.Size([1, 529]) torch.Size([24, 1, 747, 512])
0.005406618118286133s
torch.Size([1, 529]) torch.Size([24, 1, 747, 512])
0.006296873092651367s
torch.Size([1, 529]) torch.Size([24, 1, 747, 512])
0.005514383316040039s
torch.Size([1, 529]) torch.Size([24, 1, 747, 512])
0.006400585174560547s
torch.Size([1, 529]) torch.Size([24, 1, 747, 512])
0.005585670471191406s

CallmeZhangChenchen commented 3 months ago

So the internal data transfer should look something like this

BLS： GPU memory Tensorrt： GPU memory copy -> Tensorrt Core -> GPU memory copy BLS： -> GPU memory

There should be some kind of transfer between BLS： GPU memory and Tensorrt： GPU memory that causes the speed to increase from 2ms to the current 6ms

What should I do to achieve the ultimate 2ms？

oandreeva-nv commented 3 months ago

Hi @CallmeZhangChenchen , could you please verify that FORCE_CPU_ONLY_INPUT_TENSORS is set to "no" for all your models. Reference

CallmeZhangChenchen commented 3 months ago

Hi! @oandreeva-nv ,Thank you for your attention.

All models are set to "no"

Here I have minimized and packed a package, mainly the model is too large, there is also a README.md, just need to convert the model, start the service will do

https://drive.google.com/file/d/17xGB0dEQ4ybvKUpQOlv8gTczfdBZnIKJ/view?usp=sharing

thanks a million!

CallmeZhangChenchen commented 2 months ago

The python API for direct TensorRT is 4ms and using infer_request.exec() is 6ms, so I'll abandon the BLS model and use the API for direct calls

    import time; 
    begin = time.time()
    # Run inference.
    execute_async_func()
    cuda_call(cudart.cudaStreamSynchronize(stream))
    print(time.time() - begin)

0.004782199859619141s
0.004822254180908203s
0.004093170166015625s
0.0040912628173828125s
0.004096508026123047s
0.004090070724487305s

Tabrizian commented 1 month ago

Could you please try to increase the --cuda-memory-pool-byte-size to see if it helps? If you run out of CUDA memory pool, the cross memory data transfers can take longer.

triton-inference-server / server

infer_request.exec() run slowly #7413