Closed ArgoHA closed 1 year ago
@ArgoHA Looks like you are creating the grpcclient.InferInput
with every predict call. This means a new protobuf object is created with every inference run. Can you create a single grpc.InferInput object and then call set_data_from_numpy
for each inference run?
That being said, the requests are still going through grpc endpoint which will include message marshalling/unmarshalling and some communication costs. So, the single stream performance will not match with the standalone application.
You can feed multiple streams and Triton will effectively scale the inferences on the available model instances for better throughput.
You can also query inference statistics from the server using this API: https://github.com/triton-inference-server/client/blob/main/src/python/library/tritonclient/grpc/__init__.py#L712
More information about inference statistics here: https://github.com/triton-inference-server/server/blob/main/docs/protocol/extension_statistics.md
@tanmayv25 Thanks for the answer
I tried to use time.perf_counter()
to measure the speed of inputs.append(grpcclient.InferInput(self.input_name, [*input_images.shape], self.fp))
and got 0.00035297730937600136
seconds, so it doesn't really matter.
But I also tried to create an object in init and then use set_data_from_numpy
, but my fps did't change.
Hi @ArgoHA can you share your inference statistics here? If you have a reproducer with your current setup it would be helpful for us to reproduce if it's a bug
@jbkyang-nvi Do you mean something specific by a reproducer? I can try to create a iso backup of the whole system. Or I can give you entire code of the project, assuming that you have installed the same triton server version
Any updates? I got the same issue T T, also 2 times slower
@ArgoHA I think @jbkyang-nvi is asking for output from statistics. Output from get_inference_statistics API call. As explained above, Triton clients will spend some time sending tensor bytes across. The output from the inference_statistics will help us understand what all parts request is spending time in. Based on that we can provide some suggestion to avoid extra data copies that Triton pipeline currently incurs. Some suggestion that might help here is to use shared memory to send data from client process to server. More on this here: https://github.com/triton-inference-server/server/blob/main/docs/protocol/extension_shared_memory.md You should see performance improvement using shared memory. We have some example using system shared memory with gRPC client here: https://github.com/triton-inference-server/client/blob/main/src/python/examples/simple_grpc_shm_client.py
Closing due to in-activity.
Description Hi! I get around 2 times less fps using triton server in comparison to non triton inference. Here is what I've got: 1) I took yolov5s pretrained weights from https://github.com/ultralytics/yolov5 2) I exported them to
.engine
(.plan) 3) I rundetect.py
from yolov5 repo and got 10 fps on a test video (on jetson nano) 4) I created a yolov5_GRPC pipeline, put model.plan and run triton server. Then I run me pipeline and get around 5 fps on the same test video.Can anyone give me a hint, do I have a bug?
In both inferences I used fp32. I also inspected jtop and I am pretty sure that with detec.py I get longer 100% usage of the gpu, when with triton I more often see 0% usage in between of 100's.
Triton Information tritonserver2.19.0-jetpack4.6.1
Are you using the Triton container or did you build it yourself? Build
To Reproduce
main.py
config.pbtxt
utils.py
yolov5_grpc.py