Model profiling without model analyzer

ApoorveK commented 2 years ago

Description After deploying our ML model in triton server, I wanted to do the stress testing of the model by sending parallel requests but I got only one response form the triton server.

Triton Information 22.08

I am using Triton container

To Reproduce Please write a multi-processing code to send multiple parallel requests to get the inference.

Model framework- TensorFlow Model type- Ensemble Number of models- 3 Sequence of models- preprocessing, inference, post-processing

Expected behaviour For 'n' parallel requests, I should receive 'n' responses from the server which I am not getting right now and I should be able to analyse the model's performance and response time without using Triton's Model Analyzer. (since Model-Analyzer doesn't support ensemble model for now).

kthui commented 2 years ago

Hi @ApoorveK, can you share more details on how to replicate the issue?

ApoorveK commented 2 years ago

Hello @kthui thanks for replying. So I have the following Async client code, to send inference requests. ensembleModelInference.txt

And this code is async client code, which take string as input and send inference requests for hosted ensemble model on triton inference server (this ensemble model comprise of total 3 models in which 2 are python backend models and 1 is of ONNX model for inference). For python backend models, we used default draft, which is given in doc here

Now this code actually works for serial requests (giving 20 ms as time taken for inference) but now we have to test if model can handle concurrent requests. So would like your help/suggestions in making this code handling parallel requests easily. Also some following doubts:

For two python models having similar name for class and same methods (initialise and execute) causes issue in performance.
Do we have to make changes only in client code side or changes to be included in python models side as well in model directory, to make them handle parallel requests.
For parallel inference request's stress testing for ensemble models, what should be procedure as currently Model-analyzer doesn't support ensemble models { see this issue }

@debadityamandal (++)

kthui commented 2 years ago

Hi @ApoorveK, for async infer with the asyncio client, you can take a look at this example on how to do it. Both server and python_backend support parallel requests, so from what I can see, your client code will need some changes.

I am not familiar with python_backend related performance concern, @Tabrizian can you provide some context for point 1?

For point 3, I think it is better to ask on the Model Analyzer's issue page, as they can provide more insight into the procedure.

Tabrizian commented 2 years ago

@ApoorveK

For two python models having similar name for class and same methods (initialise and execute) causes issue in performance.

Can you elaborate more on what is the performance issue that you are observing?

Do we have to make changes only in client code side or changes to be included in python models side as well in model directory, to make them handle parallel requests.

If you just want to send multiple requests to the model for profiling purposes, you can use perf_analyzer with concurrencies of larger than 1. Perf Analyzer can profile ensemble models too.

ApoorveK commented 1 year ago

@Tabrizian thank you for responding, actually the issue was with some incorrect implementation of client code for triton model analyzer, as were trying to send input to triton server through post request, can you state the correct format for that!

we can do perf analyzer for the ensemble mode, but actually was looking to use model-analyzer for it's extra features.

jbkyang-nvi commented 1 year ago

Closing this issue due to lack of activity. If this issue needs follow-up, please let us know and we can reopen it for you.

ApoorveK commented 1 year ago

@jbkyang-nvi @Tabrizian actually there is an issue with the current perf analyzer, while using it for ensemble models. As even if with no additional parameters and only input file, the perf analyzer is showing very high latency (10000-15000ms locally) as compared to brute force method (i.e. calculating time taken by server to serve request, by python client code locally which is around 40-60 ms)

jbkyang-nvi commented 1 year ago

@tgerdesnv @matthewkotila

matthewkotila commented 1 year ago

Hi @ApoorveK , sorry you're running into an issue using Perf Analyzer. We've created a ticket to investigate this.

But to help us with our investigation, could you provide more complete commands for reproducing your issue?

e.g. how to get the model downloaded from somewhere, put in a proper directory structure, launching the triton server, and the Perf Analyzer run command.

ApoorveK commented 1 year ago

@matthewkotila mostly i was running a similar example for ensemble model (which is being given is examples of triton inference server (https://github.com/triton-inference-server/python_backend/tree/main/examples/preprocessing). So while using perf analyzer (since model-analyzer doesn't support ensemble models for now) for it, the results were very weird, as the latency for the model interview is too high (as i mentioned above).

with the model directory structure as follows:-

├── ensembleModel_bfsi
│   ├── 1
│   ├── config.pbtxt
├── inferenceModel_bfsi
│   ├── 1
│   │   ├── model.onnx
│   │   ├── symbolic_shape_infer.py
│   ├── config.pbtxt
├── postProcessModel_bfsi
│   ├── 1
│   │   ├── artifacts
│   │   │   ├── config.json
│   │   │   ├── labels.npy
│   │   │   ├── tokenizer.pickle
│   │   ├── model.py
│   │   ├── model.py.dvc
│   │   ├── triton_python_backend_utils.py
│   ├── config.pbtxt
└── preProcessModel_bfsi
    ├── 1
    │   ├── artifacts
    │   │   ├── config.json
    │   │   ├── labels.npy
    │   │   ├── tokenizer.pickle
    │   ├── model.py
    │   ├── triton_python_backend_utils.py
    ├── config.pbtxt

here preprocess model and post process models are python models, and inference model is 'onnx' model. Initially i was trying to start perf analyzer for whole ensemble model but later on switched to model-analyzer only for inference model. since the report generation and other features of model-analyzer were needed. Same as steps require for model analyzer, after starting the container with command in main directory:->

docker run -it \
--name model-analyzer-trial \
-v /var/run/docker.sock:/var/run/docker.sock \
-v $(pwd):/workspace \
--net=host model-analyzer:latest

and then inside the model-analyzer:-

=============================
=== Triton Model Analyzer ===
=============================
NVIDIA Release 22.11 (build 48581223)
Copyright (c) 2020-2021, NVIDIA CORPORATION.  All rights reserved.
Various files include modifications (c) NVIDIA CORPORATION.  All rights reserved.
This container image and its contents are governed by the NVIDIA Deep Learning Container License.
By pulling and using the container, you accept the terms and conditions of this license:
https://developer.nvidia.com/ngc/nvidia-deep-learning-container-license
find: File system loop detected; '/usr/bin/X11' is part of the same file system loop as '/usr/bin'.

WARNING: The NVIDIA Driver was not detected.  GPU functionality will not be available.
   Use Docker with NVIDIA Container Toolkit to start this container; see
   https://github.com/NVIDIA/nvidia-docker.

root@e2e-100-17:/opt/triton-model-analyzer# cd /workspace/
root@e2e-100-17:/workspace# ls -la model_repo_triton/
total 24
drwxr-xr-x 6 root root 4096 Dec 28 06:10 .
drwxr-xr-x 9 root root 4096 Feb  7 07:01 ..
drwxr-xr-x 3 root root 4096 Dec 28 06:10 ensembleModel_bfsi
drwxr-xr-x 3 root root 4096 Dec 28 06:10 inferenceModel_bfsi
drwxr-xr-x 3 root root 4096 Dec 28 06:10 postProcessModel_bfsi
drwxr-xr-x 3 root root 4096 Dec 28 06:10 preProcessModel_bfsi
root@e2e-100-17:/workspace# cd model_repo_triton/
root@e2e-100-17:/workspace/model_repo_triton# perf_analyzer -m model_repo_triton/ensembleModel_bfsi -b 1 -u localhost:8001 -i grpc -f ensembleModel_bfsi_config_default-results.csv --verbose-csv --concurrency-range 400 --percentile 95 --input-data stressTestData.json --shape input:1,24 --measurement-mode count_windows --collect-metrics --metrics-url http://localhost:8002/metrics --metrics-interval 1000.0
error: unsupported input data provided stressTestData.json
Usage: perf_analyzer [options]
==== SYNOPSIS ====

        --service-kind <"triton"|"tfserving"|"torchserve"|"triton_c_api">
        -m <model name>
        -x <model version>
        --model-signature-name <model signature name>
        -v

I. MEASUREMENT PARAMETERS: 
        --async (-a)
        --sync
        --measurement-interval (-p) <measurement window (in msec)>
        --concurrency-range <start:end:step>
        --request-rate-range <start:end:step>
        --request-distribution <"poisson"|"constant">
        --request-intervals <path to file containing time intervals in microseconds>
        --binary-search
        --num-of-sequences <number of concurrent sequences>
        --latency-threshold (-l) <latency threshold (in msec)>
        --max-threads <thread counts>
        --stability-percentage (-s) <deviation threshold for stable measurement (in percentage)>
        --max-trials (-r)  <maximum number of measurements for each profiling>
        --percentile <percentile>
        DEPRECATED OPTIONS
        -t <number of concurrent requests>
        -c <maximum concurrency>
        -d

II. INPUT DATA OPTIONS: 
        -b <batch size>
        --input-data <"zero"|"random"|<path>>
        --shared-memory <"system"|"cuda"|"none">
        --output-shared-memory-size <size in bytes>
        --shape <name:shape>
        --sequence-length <length>
        --sequence-id-range <start:end>
        --string-length <length>
        --string-data <string>
        DEPRECATED OPTIONS
        -z
        --data-directory <path>

III. SERVER DETAILS: 
        -u <URL for inference service>
        -i <Protocol used to communicate with inference service>
        --ssl-grpc-use-ssl <bool>
        --ssl-grpc-root-certifications-file <path>
        --ssl-grpc-private-key-file <path>
        --ssl-grpc-certificate-chain-file <path>
        --ssl-https-verify-peer <number>
        --ssl-https-verify-host <number>
        --ssl-https-ca-certificates-file <path>
        --ssl-https-client-certificate-file <path>
        --ssl-https-client-certificate-type <string>
        --ssl-https-private-key-file <path>
        --ssl-https-private-key-type <string>

IV. OTHER OPTIONS: 
        -f <filename for storing report in csv format>
        -H <HTTP header>
        --streaming
        --grpc-compression-algorithm <compression_algorithm>
        --trace-file
        --trace-level
        --trace-rate
        --trace-count
        --log-frequency
        --collect-metrics
        --metrics-url
        --metrics-interval

==== OPTIONS ==== 

 --service-kind: Describes the kind of service perf_analyzer to generate load
         for. The options are "triton", "triton_c_api", "tfserving" and
         "torchserve". Default value is "triton". Note in order to use
         "torchserve" backend --input-data option must point to a json file holding data
         in the following format {"data" : [{"TORCHSERVE_INPUT" :
         ["<complete path to the content file>"]}, {...}...]}. The type of file here
         will depend on the model. In order to use "triton_c_api" you must
         specify the Triton server install path and the model repository path via
         the --library-name and --model-repo flags
 -m:     This is a required argument and is used to specify the model against
         which to run perf_analyzer.
 -x:     The version of the above model to be used. If not specified the most
         recent version (that is, the highest numbered version) of the model
         will be used.
 --model-signature-name: The signature name of the saved model to use. Default
         value is "serving_default". This option will be ignored if
         --service-kind is not "tfserving".
 -v:     Enables verbose mode.
 -v -v:  Enables extra verbose mode.

I. MEASUREMENT PARAMETERS: 
 --async (-a): Enables asynchronous mode in perf_analyzer. By default,
         perf_analyzer will use synchronous API to request inference. However, if
         the model is sequential then default mode is asynchronous. Specify
         --sync to operate sequential models in synchronous mode. In synchronous
         mode, perf_analyzer will start threads equal to the concurrency
         level. Use asynchronous mode to limit the number of threads, yet
         maintain the concurrency.
 --sync: Force enables synchronous mode in perf_analyzer. Can be used to
         operate perf_analyzer with sequential model in synchronous mode.
 --measurement-interval (-p): Indicates the time interval used for each
         measurement in milliseconds. The perf analyzer will sample a time interval
         specified by -p and take measurement over the requests completed
         within that time interval. The default value is 5000 msec.
 --measurement-mode <"time_windows"|"count_windows">: Indicates the mode used
         for stabilizing measurements. "time_windows" will create windows
         such that the length of each window is equal to --measurement-interval.
         "count_windows" will create windows such that there are at least
         --measurement-request-count requests in each window.
 --measurement-request-count: Indicates the minimum number of requests to be
         collected in each measurement window when "count_windows" mode is
         used. This mode can be enabled using the --measurement-mode flag.
 --concurrency-range <start:end:step>: Determines the range of concurrency
         levels covered by the perf_analyzer. The perf_analyzer will start from
         the concurrency level of 'start' and go till 'end' with a stride of
         'step'. The default value of 'end' and 'step' are 1. If 'end' is not
         specified then perf_analyzer will run for a single concurrency
         level determined by 'start'. If 'end' is set as 0, then the concurrency
         limit will be incremented by 'step' till latency threshold is met.
         'end' and --latency-threshold can not be both 0 simultaneously. 'end'
         can not be 0 for sequence models while using asynchronous mode.
 --request-rate-range <start:end:step>: Determines the range of request rates
         for load generated by analyzer. This option can take floating-point
         values. The search along the request rate range is enabled only when
         using this option. If not specified, then analyzer will search
         along the concurrency-range. The perf_analyzer will start from the
         request rate of 'start' and go till 'end' with a stride of 'step'. The
         default values of 'start', 'end' and 'step' are all 1.0. If 'end' is
         not specified then perf_analyzer will run for a single request rate
         as determined by 'start'. If 'end' is set as 0.0, then the request
         rate will be incremented by 'step' till latency threshold is met.
         'end' and --latency-threshold can not be both 0 simultaneously.
 --request-distribution <"poisson"|"constant">: Specifies the time interval
         distribution between dispatching inference requests to the server.
         Poisson distribution closely mimics the real-world work load on a
         server. This option is ignored if not using --request-rate-range. By
         default, this option is set to be constant.
 --request-intervals: Specifies a path to a file containing time intervals in
         microseconds. Each time interval should be in a new line. The
         analyzer will try to maintain time intervals between successive generated
         requests to be as close as possible in this file. This option can be
         used to apply custom load to server with a certain pattern of
         interest. The analyzer will loop around the file if the duration of
         execution exceeds to that accounted for by the intervals. This option can
         not be used with --request-rate-range or --concurrency-range.
--binary-search: Enables the binary search on the specified search range. This
         option requires 'start' and 'end' to be expilicitly specified in
         the --concurrency-range or --request-rate-range. When using this
         option, 'step' is more like the precision. Lower the 'step', more the
         number of iterations along the search path to find suitable
         convergence. By default, linear search is used.
--num-of-sequences: Sets the number of concurrent sequences for sequence
         models. This option is ignored when --request-rate-range is not
         specified. By default, its value is 4.
 --latency-threshold (-l): Sets the limit on the observed latency. Analyzer
         will terminate the concurrency search once the measured latency
         exceeds this threshold. By default, latency threshold is set 0 and the
         perf_analyzer will run for entire --concurrency-range.
 --max-threads: Sets the maximum number of threads that will be created for
         providing desired concurrency or request rate. However, when runningin
         synchronous mode with concurrency-range having explicit 'end'
         specification,this value will be ignored. Default is 4 if
         --request-rate-range is specified otherwise default is 16.
 --stability-percentage (-s): Indicates the allowed variation in latency
         measurements when determining if a result is stable. The measurement is
         considered as stable if the ratio of max / min from the recent 3
         measurements is within (stability percentage)% in terms of both infer
         per second and latency. Default is 10(%).
 --max-trials (-r): Indicates the maximum number of measurements for each
         concurrency level visited during search. The perf analyzer will take
         multiple measurements and report the measurement until it is stable.
         The perf analyzer will abort if the measurement is still unstable
         after the maximum number of measurements. The default value is 10.
 --percentile: Indicates the confidence value as a percentile that will be
         used to determine if a measurement is stable. For example, a value of
         85 indicates that the 85th percentile latency will be used to
         determine stability. The percentile will also be reported in the results.
         The default is -1 indicating that the average latency is used to
         determine stability

II. INPUT DATA OPTIONS: 
 -b:     Batch size for each request sent.
 --input-data: Select the type of data that will be used for input in
         inference requests. The available options are "zero", "random", path to a
         directory or a json file. If the option is path to a directory then
         the directory must contain a binary/text file for each
         non-string/string input respectively, named the same as the input. Each file must
         contain the data required for that input for a batch-1 request. Each
         binary file should contain the raw binary representation of the
         input in row-major order for non-string inputs. The text file should
         contain all strings needed by batch-1, each in a new line, listed in
         row-major order. When pointing to a json file, user must adhere to the
         format described in the Performance Analyzer documentation. By
         specifying json data users can control data used with every request.
         Multiple data streams can be specified for a sequence model and the
         analyzer will select a data stream in a round-robin fashion for every
         new sequence. Muliple json files can also be provided (--input-data
         json_file1 --input-data json-file2 and so on) and the analyzer will
         append data streams from each file. When using
         --service-kind=torchserve make sure this option points to a json file. Default is
         "random".
 --shared-memory <"system"|"cuda"|"none">: Specifies the type of the shared
         memory to use for input and output data. Default is none.
 --output-shared-memory-size: The size in bytes of the shared memory region to
         allocate per output tensor. Only needed when one or more of the
         outputs are of string type and/or variable shape. The value should be
         larger than the size of the largest output tensor the model is
         expected to return. The analyzer will use the following formula to
         calculate the total shared memory to allocate: output_shared_memory_size *
         number_of_outputs * batch_size. Defaults to 100KB.
 --shape: The shape used for the specified input. The argument must be
         specified as 'name:shape' where the shape is a comma-separated list for
         dimension sizes, for example '--shape input_name:1,2,3' indicate tensor
         shape [ 1, 2, 3 ]. --shape may be specified multiple times to
         specify shapes for different inputs.
 --sequence-length: Indicates the base length of a sequence used for sequence
         models. A sequence with length x will be composed of x requests to
         be sent as the elements in the sequence. The length of the actual
         sequence will be within +/- 20% of the base length.
 --sequence-id-range <start:end>: Determines the range of sequence id used by
         the perf_analyzer. The perf_analyzer will start from the sequence id
         of 'start' and go till 'end' (excluded). If 'end' is not specified
         then perf_analyzer will use new sequence id without bounds. If 'end'
         is specified and the concurrency setting may result in maintaining
         a number of sequences more than the range of available sequence id,
         perf analyzer will exit with error due to possible sequence id
         collision. The default setting is start from sequence id 1 and without
         bounds
 --string-length: Specifies the length of the random strings to be generated
         by the analyzer for string input. This option is ignored if
         --input-data points to a directory. Default is 128.
 --string-data: If provided, analyzer will use this string to initialize
         string input buffers. The perf analyzer will replicate the given string
         to build tensors of required shape. --string-length will not have any
         effect. This option is ignored if --input-data points to a
         directory.

III. SERVER DETAILS: 
 -u:                                  Specify URL to the server. When using triton default is "localhost:8000" if using HTTP and
         "localhost:8001" if using gRPC. When using tfserving default is
         "localhost:8500". 
 -i:                                  The communication protocol to use. The available protocols are gRPC and HTTP. Default is HTTP.
 --ssl-grpc-use-ssl:                  Bool (true|false) for whether to use encrypted channel to the server. Default false.
 --ssl-grpc-root-certifications-file: Path to file containing the PEM encoding of the server root certificates.
 --ssl-grpc-private-key-file:         Path to file containing the PEM encoding of the client's private key.
 --ssl-grpc-certificate-chain-file:   Path to file containing the PEM encoding of the client's certificate chain.
 --ssl-https-verify-peer:             Number (0|1) to verify the peer's SSL certificate. See
         https://curl.se/libcurl/c/CURLOPT_SSL_VERIFYPEER.html for the meaning of each value. Default is 1.
 --ssl-https-verify-host:             Number (0|1|2) to verify the certificate's name against host. See
         https://curl.se/libcurl/c/CURLOPT_SSL_VERIFYHOST.html for the meaning of each value. Default is 2.
 --ssl-https-ca-certificates-file:    Path to Certificate Authority (CA) bundle.
 --ssl-https-client-certificate-file: Path to the SSL client certificate.
 --ssl-https-client-certificate-type: Type (PEM|DER) of the client SSL certificate. Default is PEM.
 --ssl-https-private-key-file:        Path to the private keyfile for TLS and SSL client cert.
 --ssl-https-private-key-type:        Type (PEM|DER) of the private key file. Default is PEM.

IV. OTHER OPTIONS: 
 -f:     The latency report will be stored in the file named by this option.
         By default, the result is not recorded in a file.
 -H:     The header will be added to HTTP requests (ignored for GRPC
         requests). The header must be specified as 'Header:Value'. -H may be
         specified multiple times to add multiple headers.
 --streaming: Enables the use of streaming API. This flag is only valid with
         gRPC protocol. By default, it is set false.
 --grpc-compression-algorithm: The compression algorithm to be used by gRPC
         when sending request. Only supported when grpc protocol is being used.
         The supported values are none, gzip, and deflate. Default value is
         none.
 --trace-file: Set the file where trace output will be saved. If
         --trace-log-frequency is also specified, this argument value will be the prefix
         of the files to save the trace output. See --trace-log-frequency for
         details. Only used for service-kind of triton. Default value is
         none.
--trace-level: Specify a trace level. OFF to disable tracing, TIMESTAMPS to
         trace timestamps, TENSORS to trace tensors. It may be specified
         multiple times to trace multiple informations. Default is OFF.
 --trace-rate: Set the trace sampling rate. Default is 1000.
 --trace-count: Set the number of traces to be sampled. If the value is -1,
         the number of traces to be sampled will not be limited. Default is -1.
 --log-frequency:  Set the trace log frequency. If the value is 0, Triton will
         only log the trace output to <trace-file> when shutting down.
         Otherwise, Triton will log the trace output to <trace-file>.<idx> when it
         collects the specified number of traces. For example, if the log
         frequency is 100, when Triton collects the 100-th trace, it logs the
         traces to file <trace-file>.0, and when it collects the 200-th trace,
         it logs the 101-th to the 200-th traces to file <trace-file>.1.
         Default is 0.
 --triton-server-directory: The Triton server install path. Required by and
         only used when C API is used (--service-kind=triton_c_api).
         eg:--triton-server-directory=/opt/tritonserver.
 --model-repository: The model repository of which the model is loaded.
         Required by and only used when C API is used
         (--service-kind=triton_c_api). eg:--model-repository=/tmp/host/docker-data/model_unit_test.
 --verbose-csv: The csv files generated by perf analyzer will include
         additional information.
 --collect-metrics: Enables collection of server-side inference server
         metrics. Outputs metrics in the csv file generated with the -f option. Must
         enable `--verbose-csv` option to use the `--collect-metrics`.
 --metrics-url: The URL to query for server-side inference server metrics.
         Default is 'localhost:8002/metrics'.
 --metrics-interval: How often in milliseconds, within each measurement
         window, to query for server-side inference server metrics. Default is
         1000.

and the supplied data here is :-

{
    "data" :
     [
        {
          "input" : "Hi, I will not pay you tomorrow",
          "shape" : [1]
        }
      ]
  }

Here Data is flowing as -> preProcessModel -> InferenceModel -> postProcessModel

matthewkotila commented 1 year ago

Hi @ApoorveK can you be more specific with the exact commands you ran to download/generate the model repository you've referenced above? How did you get the ensembleModel_bfsi, inferenceModel_bfsi, postProcessModel_bfsi, and preProcessModel_bfsi models?

And can you also show the exact PA output that seemed incorrect?

And can you also give specific commands for how you ran the brute force method (i.e. calculating time taken by server to serve request mentioned above?

ApoorveK commented 1 year ago

Hello @matthewkotila Thanks for replying. So these 3 models (inferenceModel_bfsi, postProcessModel_bfsi, and preProcessModel_bfsi) are actually specific use-case models which our organisation is using to try and test deployment on triton inference server. Input for the 'preProcessModel_bfsi' (python model) is a string sentence, which being processes and passed to 'inferenceModel_bfsi' (onnx model) which give dictionary as output and then it get pre-processed through 'postProcessModel_bfsi' (python model) and gives final output after inference.

So with the model-analyzer container being started, with stressTestData.json containing input data as string.

Also, what exactly i did (the commands i ran) is being pasted along with the output actually. So the only help here i need is to set the current way for the supplied json file for model-analyzer and any other potential issues.

matthewkotila commented 1 year ago

Hi @ApoorveK, thanks for bringing up the potential Perf Analyzer issue. After thorough investigation, I've determined that this is expected behavior.

The reason you saw latency so high was because it includes the amount of time requests were waiting in the triton server queue. Why were there so many requests waiting in the queue? Because you ran Perf Analyzer with a concurrency of 400. That means roughly 399 requests of your model were all waiting in a queue while one was being run through the model at a time. That means latency will be quite high. Imagine the individual request latency for the 400th sent request. It was sent and entered the queue roughly the same time as the first 399 (since Perf Analyzer used 400 concurrent threads to send their first requests), but had to wait for the first 399 to be sent through the model, one at a time. This means the end-to-end latency for that 400th request will be quite high. This is essentially why the average latency is so high.

Let me know if you have any follow-up questions. Closing this issue.

ApoorveK commented 1 year ago

@matthewkotila actually the perf-analyzer is not starting due to some syntax issues (the one which i have supplied in above logs) so can you suggest the correct one.

matthewkotila commented 1 year ago

It looks like your stressTestData.json wasn't in the current directory.

triton-inference-server / server

Model profiling without model analyzer #4986