Closed FawadAbbas12 closed 12 months ago
Hi @FawadAbbas12, assuming you used the same
def runtime_monitor(some_function):
from time import time
def wrapper(*args, **kwargs):
t1 = time()
result = some_function(*args, **kwargs)
end = time()-t1
print(f'{some_function.__name__} Time : {1/end}')
return result
return wrapper
on the model to record the execution duration of the execute()
function. The duration on the server side only includes execution duration, while the duration on the client side includes execution duration and data transmission duration, so this is not an apple-to-apple comparison. We recommend using the Triton Performance Analyzer to measure throughput, such that non-apple-to-apple comparisons can be avoided.
Sorry for not mentioning but I have also used pref analyzer for grpc endpoint and results are same it says that model can support 50 inference per sec whereas it return 25.
model can support 50 inference per sec whereas it return 25
I assume you means the perf analyzer benchmarked the throughput as 50 infer/sec, while you client only achieved 25 infer/sec?
I think the issue is here
results = triton_client.infer(model_name='mixformer_conv_mae',
inputs=self.inputs,
outputs=outputs)
where the next inference will wait until the previous inference is completed and returned before starting, so there will be a gRPC communication gap between inferences. One way to solve this is to use async_infer()
instead of infer()
to enable overlapping between inferences on the client. You could read more about async_infer()
here https://github.com/triton-inference-server/client/blob/main/src/python/library/tritonclient/grpc/_client.py#L1567
Thanks I will try to test it and will report back results
model can support 50 inference per sec whereas it return 25
I assume you means the perf analyzer benchmarked the throughput as 50 infer/sec, while you client only achieved 25 infer/sec?
I think the issue is here
results = triton_client.infer(model_name='mixformer_conv_mae', inputs=self.inputs, outputs=outputs)
where the next inference will wait until the previous inference is completed and returned before starting, so there will be a gRPC communication gap between inferences. One way to solve this is to use
async_infer()
instead ofinfer()
to enable overlapping between inferences on the client. You could read more aboutasync_infer()
here https://github.com/triton-inference-server/client/blob/main/src/python/library/tritonclient/grpc/_client.py#L1567
thanks @kthui for pointing out the issue when i use async_infer i get same fps :)
Description There is almost 50% drop in FPS during transmission on the same system
Triton Information What version of Triton are you using? 2.22.0 Are you using the Triton container or did you build it yourself? Triton Container: nvcr.io/nvidia/tritonserver:22.05-py3 To Reproduce cannot share complete code here (If required then i can create a separate repo as backed code is quite big)but here is inference part
Describe the models (framework, inputs, outputs), ideally include the model configuration file (if using an ensemble include the model configuration file for that as well).
Expected behavior A clear and concise description of what you expected to happen. I have also added a runtime monitor wrapper for TritonPythonModel class and it shows that mode have completed inference at 30 FPS but on receiver side it show that model's performance is around 15 FPS