Open CallmeZhangChenchen opened 3 months ago
I'm guessing the data transfer in my code looks something like this:
In that case, I can only:
When I move the Input data to the CPU, it takes a lot of time, so I rule out the assumption that Input is placed on the CPU
y = pb_utils.Tensor('iy', torch.randint(50,500, (1,529)).numpy())
k = pb_utils.Tensor('ik', torch.rand(24, 1, 747, 512).numpy())
v = pb_utils.Tensor('iv', torch.rand(24, 1, 747, 512).numpy())
xy_pos = pb_utils.Tensor('ixy_pos',torch.rand(1,1,512).numpy())
idx = pb_utils.Tensor('iidx',torch.tensor([1]).numpy())
rand_tensor = pb_utils.Tensor('rand_tensor', torch.rand(1,1025).numpy())
infer_request = pb_utils.InferenceRequest(
model_name="t2s_sdec",
inputs=[y,k,v,idx, xy_pos, rand_tensor],
requested_output_names=["y","k", "v", "logits", "samples","xy_pos"],
)
b = time.time()
infer_responses = infer_request.exec()#decoupled=True)
print((time.time() - b))
totaltime = totaltime + (time.time() - b)
0.026983022689819336s
0.02678084373474121s
0.02610945701599121s
0.028659343719482422s
0.023781538009643555s
Minimize the amount of data transferred is OK After the input and output data types are halved (float32-> float16)
0.006856203079223633s
torch.Size([1, 529]) torch.Size([24, 1, 747, 512])
0.005406618118286133s
torch.Size([1, 529]) torch.Size([24, 1, 747, 512])
0.006296873092651367s
torch.Size([1, 529]) torch.Size([24, 1, 747, 512])
0.005514383316040039s
torch.Size([1, 529]) torch.Size([24, 1, 747, 512])
0.006400585174560547s
torch.Size([1, 529]) torch.Size([24, 1, 747, 512])
0.005585670471191406s
So the internal data transfer should look something like this
BLS: GPU memory Tensorrt: GPU memory copy -> Tensorrt Core -> GPU memory copy BLS: -> GPU memory
There should be some kind of transfer between BLS: GPU memory and Tensorrt: GPU memory that causes the speed to increase from 2ms to the current 6ms
What should I do to achieve the ultimate 2ms?
Hi @CallmeZhangChenchen , could you please verify that FORCE_CPU_ONLY_INPUT_TENSORS is set to "no" for all your models. Reference
Hi! @oandreeva-nv ,Thank you for your attention.
All models are set to "no"
Here I have minimized and packed a package, mainly the model is too large, there is also a README.md, just need to convert the model, start the service will do
https://drive.google.com/file/d/17xGB0dEQ4ybvKUpQOlv8gTczfdBZnIKJ/view?usp=sharing
thanks a million!
The python API for direct TensorRT is 4ms and using infer_request.exec() is 6ms, so I'll abandon the BLS model and use the API for direct calls
import time;
begin = time.time()
# Run inference.
execute_async_func()
cuda_call(cudart.cudaStreamSynchronize(stream))
print(time.time() - begin)
0.004782199859619141s
0.004822254180908203s
0.004093170166015625s
0.0040912628173828125s
0.004096508026123047s
0.004090070724487305s
Could you please try to increase the --cuda-memory-pool-byte-size
to see if it helps? If you run out of CUDA memory pool, the cross memory data transfers can take longer.
Description infer_request.exec() run slowly
Triton Information nvcr.io/nvidia/tritonserver:24.05-py3
To Reproduce
GPU latency
Call in bls mode
output time
Expected behavior
Normally, the input of infer_request.exec() is on the GPU and the output is on the GPU, and the time should be about 2ms, but the actual situation is 9ms