triton-inference-server / server

The Triton Inference Server provides an optimized cloud and edge inferencing solution.
https://docs.nvidia.com/deeplearning/triton-inference-server/user-guide/docs/index.html
BSD 3-Clause "New" or "Revised" License
8.4k stars 1.49k forks source link

Benchmarking VQA Model with Large Base64-Encoded Input Using perf_analyzer #7419

Open pigeonsoup opened 5 months ago

pigeonsoup commented 5 months ago

Hello,

I've been deploying my VQA (Vision Query Answer) model using Triton Server and utilizing the perf_analyzer tool for benchmarking. However, using random data for the VQA model leads to undefined behavior, making it crucial to use real input data, which is challenging to construct. Below is the command I used with perf_analyzer:

perf_analyzer -m <model_name> --request-rate-range=10 --measurement-interval=30000 --string-data '{"imageBase64Str": "/9j/4AAQS...D//Z", "textPrompt": "\u8bf7\u5e2e\...\u3002"}'

The model expects a JSON-formatted string as input, with two fields: 'imageBase64Str', which contains base64-encoded image data, and 'textPrompt', which is the text input.

Fortunately, this method works. However, the request rate is disappointingly slow, averaging 500ms per request, even when I set --request-rate-range=10. I encountered the following warning:

[WARNING] Perf Analyzer was not able to keep up with the desired request rate. 100.00% of the requests were delayed.

I'm facing difficulties in benchmarking my model effectively, as it isn't receiving a sufficient number of requests at present. I suspect that the large size of the base64 data in the '--string-data' option is contributing to the slowdown. Is there a faster or better way to send requests that could help me achieve a more accurate benchmark?

Best regards,

pigeonsoup commented 5 months ago

origin issue:https://github.com/triton-inference-server/client/issues/736