triton-inference-server / model_analyzer

Triton Model Analyzer is a CLI tool to help with better understanding of the compute and memory requirements of the Triton Inference Server models.
Apache License 2.0
426 stars 74 forks source link

How to load a pytorch model with model analyzer #699

Open benhgm opened 1 year ago

benhgm commented 1 year ago

Hi, I am trying to use model analyzer to analyze an ensemble model that contains two python models and 1 ONNX model. The python models using pytorch to perform some preprocessing and postprocessing functions.

However, when I use the following command, I get a "ModuleNotFoundError: no module named 'torch'" error. model-analyzer profile \ --model-repository=/model_repository \ --profile-models=ensemble_model --triton-launch-mode=docker \ --triton-http-endpoint=localhost:8000 --triton-grpc-endpoint=localhost:8003 --triton-metrics-url=localhost:8002 \ --output-model-repository-path=/model_analyzer_outputs/ \ --override-output-model-repository \ --run-config-search-mode quick \ --triton-output-path triton_log.txt \ --triton-docker-image devel

How do i make sure that the docker container spun up by model analyzer has pytorch installed?

tgerdesnv commented 1 year ago

Hi @benhgm, by default when Model Analyzer is run --triton-launch-mode=docker, the docker spun up will be the matching xx.yy-py3 triton server image from the Nvidia NGC. It looks like you have supplied a custom docker image called devel that will be used instead. What does that container contain? That image needs to have triton server executable plus anything else needed to run the model. The easiest way to do that if you need something special in the container is to build off of the NGC container. If you don't need anything special, then you can omit the triton-docker-image option

benhgm commented 1 year ago

Hi @tgerdesnv thanks for the tip!

To clarify, the command that gives the 'ModuleNotFoundError' error does not include the --triton-docker-image devel flag. I had incorrectly placed that in.

To provide some context, devel is a container that is built to serve my models in tritonserver. It has all the dependencies that I need for my models to work. However, when I run the command with '--triton-docker-image devel', I can the error message docker.errors.ImageNotFound: 404 Client Error for http+docker://localhost/v1.41/images/create?tag=latest&fromImage=devel: Not Found ("pull access denied for devel, repository does not exist or may require 'docker login': denied: requested access to the resource is denied").

Maybe my experience with docker is still not so good, but here are some questions I have:

  1. When i pass the input for the flag --triton-doker-image, do I provide the Image ID, the Image Repository Name or the Image Tag?
  2. When i run docker images inside the tritonserver container, loaded by the command docker run -it --gpus all \ -v /var/run/docker.sock:/var/run/docker.sock --net=host [nvcr.io/nvidia/tritonserver:23.02-py3-sdk](http://nvcr.io/nvidia/tritonserver:23.02-py3-sdk) (including mounting all the volumes i need), I get a bash: docker: command not found' error. How do I then pass an existing docker image into the--triton-docker-image` flag?

Thanks for your time and help, greatly appreciate it.

nv-braf commented 1 year ago

Support for custom local docker images was not added unit the 23.03 release. Can you try running on that (or a newer version) and let me know if you are still seeing an issue. Thanks.

benhgm commented 1 year ago

Hi @nv-braf I have changed to use the 23.03 release. When I start up the local docker image instance, I get an error in my triton log file, tritonserver: unrecognized option '--metrics-interval-ms=1000'. I did not pass that flag anywhere in my local docker instance, hence I'm not sure how it got there. Note that my local docker instance is 21.08.

NVIDIA Release 21.08 (build 26170506)

Copyright (c) 2018-2021, NVIDIA CORPORATION & AFFILIATES.  All rights reserved.

Various files include modifications (c) NVIDIA CORPORATION.  All rights reserved.

This container image and its contents are governed by the NVIDIA Deep Learning Container License.
By pulling and using the container, you accept the terms and conditions of this license:
https://developer.nvidia.com/ngc/nvidia-deep-learning-container-license

tritonserver: unrecognized option '--metrics-interval-ms=1000'
Usage: tritonserver [options]
  --help
    Print usage
  --log-verbose <integer>
    Set verbose logging level. Zero (0) disables verbose logging
    and values >= 1 enable verbose logging.
  --log-info <boolean>
    Enable/disable info-level logging.
  --log-warning <boolean>
    Enable/disable warning-level logging.
  --log-error <boolean>
    Enable/disable error-level logging.
  --id <string>
    Identifier for this server.
  --model-store <string>
    Equivalent to --model-repository.
  --model-repository <string>
    Path to model repository directory. It may be specified
    multiple times to add multiple model repositories. Note that if a model
    is not unique across all model repositories at any time, the model
    will not be available.
  --exit-on-error <boolean>
    Exit the inference server if an error occurs during
    initialization.
  --strict-model-config <boolean>
    If true model configuration files must be provided and all
    required configuration settings must be specified. If false the model
    configuration may be absent or only partially specified and the
    server will attempt to derive the missing required configuration.
  --strict-readiness <boolean>
    If true /v2/health/ready endpoint indicates ready if the
    server is responsive and all models are available. If false
    /v2/health/ready endpoint indicates ready if server is responsive even if
    some/all models are unavailable.
  --allow-http <boolean>
    Allow the server to listen for HTTP requests.
  --http-port <integer>
    The port for the server to listen on for HTTP requests.
  --http-thread-count <integer>
    Number of threads handling HTTP requests.
  --allow-grpc <boolean>
    Allow the server to listen for GRPC requests.
  --grpc-port <integer>
    The port for the server to listen on for GRPC requests.
  --grpc-infer-allocation-pool-size <integer>
    The maximum number of inference request/response objects
    that remain allocated for reuse. As long as the number of in-flight
    requests doesn't exceed this value there will be no
    allocation/deallocation of request/response objects.
  --grpc-use-ssl <boolean>
    Use SSL authentication for GRPC requests. Default is false.
  --grpc-use-ssl-mutual <boolean>
    Use mututal SSL authentication for GRPC requests. Default is
    false.
  --grpc-server-cert <string>
    File holding PEM-encoded server certificate. Ignored unless
    --grpc-use-ssl is true.
  --grpc-server-key <string>
    File holding PEM-encoded server key. Ignored unless
    --grpc-use-ssl is true.
  --grpc-root-cert <string>
    File holding PEM-encoded root certificate. Ignore unless
    --grpc-use-ssl is false.
  --grpc-infer-response-compression-level <string>
    The compression level to be used while returning the infer
    response to the peer. Allowed values are none, low, medium and high.
    By default, compression level is selected as none.
  --grpc-keepalive-time <integer>
    The period (in milliseconds) after which a keepalive ping is
    sent on the transport. Default is 7200000 (2 hours).
  --grpc-keepalive-timeout <integer>
    The period (in milliseconds) the sender of the keepalive
    ping waits for an acknowledgement. If it does not receive an
    acknowledgment within this time, it will close the connection. Default is
    20000 (20 seconds).
  --grpc-keepalive-permit-without-calls <boolean>
    Allows keepalive pings to be sent even if there are no calls
    in flight (0 : false; 1 : true). Default is 0 (false).
  --grpc-http2-max-pings-without-data <integer>
    The maximum number of pings that can be sent when there is
    no data/header frame to be sent. gRPC Core will not continue sending
    pings if we run over the limit. Setting it to 0 allows sending pings
    without such a restriction. Default is 2.
  --grpc-http2-min-recv-ping-interval-without-data <integer>
    If there are no data/header frames being sent on the
    transport, this channel argument on the server side controls the minimum
    time (in milliseconds) that gRPC Core would expect between receiving
    successive pings. If the time between successive pings is less than
    this time, then the ping will be considered a bad ping from the peer.
    Such a ping counts as a ‘ping strike’. Default is 300000 (5
    minutes).
  --grpc-http2-max-ping-strikes <integer>
    Maximum number of bad pings that the server will tolerate
    before sending an HTTP2 GOAWAY frame and closing the transport.
    Setting it to 0 allows the server to accept any number of bad pings.
    Default is 2.
  --allow-sagemaker <boolean>
    Allow the server to listen for Sagemaker requests. Default
    is false.
  --sagemaker-port <integer>
    The port for the server to listen on for Sagemaker requests.
    Default is 8080.
  --sagemaker-safe-port-range <<integer>-<integer>>
    Set the allowed port range for endpoints other than the
    SageMaker endpoints.
  --sagemaker-thread-count <integer>
    Number of threads handling Sagemaker requests. Default is 8.
  --allow-metrics <boolean>
    Allow the server to provide prometheus metrics.
  --allow-gpu-metrics <boolean>
    Allow the server to provide GPU metrics. Ignored unless
    --allow-metrics is true.
  --metrics-port <integer>
    The port reporting prometheus metrics.
  --trace-file <string>
    Set the file where trace output will be saved.
  --trace-level <string>
    Set the trace level. OFF to disable tracing, MIN for minimal
    tracing, MAX for maximal tracing. Default is OFF.
  --trace-rate <integer>
    Set the trace sampling rate. Default is 1000.
  --model-control-mode <string>
    Specify the mode for model management. Options are "none",
    "poll" and "explicit". The default is "none". For "none", the server
    will load all models in the model repository(s) at startup and will
    not make any changes to the load models after that. For "poll", the
    server will poll the model repository(s) to detect changes and will
    load/unload models based on those changes. The poll rate is
    controlled by 'repository-poll-secs'. For "explicit", model load and unload
    is initiated by using the model control APIs, and only models
    specified with --load-model will be loaded at startup.
  --repository-poll-secs <integer>
    Interval in seconds between each poll of the model
    repository to check for changes. Valid only when --model-control-mode=poll is
    specified.
  --load-model <string>
    Name of the model to be loaded on server startup. It may be
    specified multiple times to add multiple models. Note that this
    option will only take affect if --model-control-mode=explicit is true.
  --pinned-memory-pool-byte-size <integer>
    The total byte size that can be allocated as pinned system
    memory. If GPU support is enabled, the server will allocate pinned
    system memory to accelerate data transfer between host and devices
    until it exceeds the specified byte size. If 'numa-node' is configured
    via --host-policy, the pinned system memory of the pool size will be
    allocated on each numa node. This option will not affect the
    allocation conducted by the backend frameworks. Default is 256 MB.
  --cuda-memory-pool-byte-size <<integer>:<integer>>
    The total byte size that can be allocated as CUDA memory for
    the GPU device. If GPU support is enabled, the server will allocate
    CUDA memory to minimize data transfer between host and devices
    until it exceeds the specified byte size. This option will not affect
    the allocation conducted by the backend frameworks. The argument
    should be 2 integers separated by colons in the format <GPU device
    ID>:<pool byte size>. This option can be used multiple times, but only
    once per GPU device. Subsequent uses will overwrite previous uses for
    the same GPU device. Default is 64 MB.
tgerdesnv commented 1 year ago

@benhgm Are you able to move to a newer version of triton server, ideally 23.03 to match your SDK version (or move both to the latest 23.05 release)? As you have observed, using different versions between the two can cause incompatibilities.

benhgm commented 1 year ago

@tgerdesnv thanks for the advice! I tried that and was able to run a full analysis on my ensemble model. I got some very nice results and report, but there is now one small error/glitch that I see, where the model analyzer reported that No GPU metric corresponding to tag 'gpu_used_memory' found in the model's measurement. Possibly comparing measurements across devices. and No GPU metric corresponding to tag 'gpu_used_memory' found in the model's measurement. Possibly comparing measurements across devices.

From the message, I guess this is because I ran the analysis over a multi-GPU instance, and if I set the --gpus flag to a specific GPU UUID, I will be able to get these metrics. I will try it out and update if I face the same error.

Otherwise, how can I enable GPU metrics reporting even on a multi-GPU instance?

nv-braf commented 1 year ago

This warning occurs when a measurement returning from Perf Analyzer does not contain a GPU metric, in this case, the amount of memory used by the GPU, when it was expected. Yes, please try to specify the GPU you want to profile on with the --gpus flag and let me know if this doesn't remove the warning.

benhgm commented 1 year ago

@nv-braf hello, I tried by setting --gpus to the UUID of the GPU I want to use, and model analysis began with the correct GPU that I specified. However, at the end of the analysis, I got the same error messages. Are there any other workarounds I can try?

nv-braf commented 1 year ago

Are measurements being taken? Are the charts/data being outputted correctly at the end of profile? If so, then it's probably safe to ignore this warning message.

tgerdesnv commented 1 year ago

Have you specified CPU_ONLY anywhere in the original configuration? Do the resulting output model configurations have KIND_CPU or KIND_GPU under instance_group?

My concern if you are getting no GPU metrics is that nothing is actually running on the GPU.

benhgm commented 1 year ago

Hi @tgerdesnv you make a good point. I realised that although I had put KIND_GPU for all my models, in my pre and post-processing models, I did not explicitly pass the models to GPU using a .to(torch.device("cuda")).

However, my main inference model (a CNN) has always been set to run in GPU. Which I am puzzled over as to why no GPU metrics were recorded for that.

tgerdesnv commented 1 year ago

Can you answer @nv-braf 's question?

Are measurements being taken? Are the charts/data being outputted correctly at the end of profile? If so, then it's probably safe to ignore this warning message.

Those warnings may show up if any of the models are running entirely on the CPU.

benhgm commented 1 year ago

Hi @tgerdesnv @nv-braf my apologies, I missed out on the other question.

Yes I am getting measurements on the latency and throughput, those are fine. I was just wondering how to make the GPU metrics appear.

@tgerdesnv I understand what you mean. However, if in an ensemble model, for example, where I have a pipeline of pre processing model -> CNN -> post processing model and only the CNN is on GPU, should I expect GPU metrics to be recorded from the CNN side even though the pre and post processing models are on CPU?

nv-braf commented 1 year ago

As long as you have not set the cpu_only flag I would expect the composing config to gather GPU metrics and they should be shown in the summary report. Can you confirm that you are not seeing any GPU metrics (like GPU utilization or GPU memory usage) in the summary report table?

benhgm commented 1 year ago

Hi @nv-braf yes I confirm that I did not use the cpu_only flag and I did not encounter any GPU metrics.

riyajatar37003 commented 5 months ago

i am running the examples/add_sub with local model , and with cpu instances but i am getting follwoing error log in docker container

root@cfbe7ff7cf1e:/app/ma# model-analyzer profile \ --model-repository /app/ma/examples/quick-start \ --profile-models add_sub \ --output-model-repository-path /app/ma/output11 \ --export-path profile_results --triton-launch-mode=local [Model Analyzer] Starting a local Triton Server [Model Analyzer] Loaded checkpoint from file /app/ma/checkpoints/0.ckpt [Model Analyzer] GPU devices match checkpoint - skipping server metric acquisition [Model Analyzer] Starting a local Triton Server [Model Analyzer] Model add_sub load failed: [StatusCode.INTERNAL] failed to load 'add_sub', failed to poll from model repository [Model Analyzer] Model readiness failed for model add_sub. Error [StatusCode.UNAVAILABLE] failed to connect to all addresses; last error: UNKNOWN: ipv4:127.0.0.1:8001: Failed to connect to remote host: Connection refused [Model Analyzer] Model readiness failed for model add_sub. Error [StatusCode.UNAVAILABLE] failed to connect to all addresses; last error: UNKNOWN: ipv6:%5B::1%5D:8001: Failed to connect to remote host: Connection refused [Model Analyzer] Saved checkpoint to /app/ma/checkpoints/1.ckpt Traceback (most recent call last): File "/opt/app_venv/bin/model-analyzer", line 8, in sys.exit(main()) File "/opt/app_venv/lib/python3.10/site-packages/model_analyzer/entrypoint.py", line 278, in main analyzer.profile( File "/opt/app_venv/lib/python3.10/site-packages/model_analyzer/analyzer.py", line 124, in profile self._profile_models() File "/opt/app_venv/lib/python3.10/site-packages/model_analyzer/analyzer.py", line 242, in _profile_models self._model_manager.run_models(models=[model]) File "/opt/app_venv/lib/python3.10/site-packages/model_analyzer/model_manager.py", line 118, in run_models self._check_for_ensemble_model_incompatibility(models) File "/opt/app_venv/lib/python3.10/site-packages/model_analyzer/model_manager.py", line 189, in _check_for_ensemble_model_incompatibility model_config = ModelConfig.create_from_profile_spec( File "/opt/app_venv/lib/python3.10/site-packages/model_analyzer/triton/model/model_config.py", line 270, in create_from_profile_spec model_config_dict = ModelConfig.create_model_config_dict( File "/opt/app_venv/lib/python3.10/site-packages/model_analyzer/triton/model/model_config.py", line 92, in create_model_config_dict config = ModelConfig._get_default_config_from_server( File "/opt/app_venv/lib/python3.10/site-packages/model_analyzer/triton/model/model_config.py", line 149, in _get_default_config_from_server config = client.get_model_config(model_name, config.client_max_retries) File "/opt/app_venv/lib/python3.10/site-packages/model_analyzer/triton/client/grpc_client.py", line 79, in get_model_config model_config_dict = self._client.get_model_config(model_name, as_json=True) File "/opt/app_venv/lib/python3.10/site-packages/tritonclient/grpc/_client.py", line 593, in get_model_config raise_error_grpc(rpc_error) File "/opt/app_venv/lib/python3.10/site-packages/tritonclient/grpc/_utils.py", line 77, in raise_error_grpc raise get_error_grpc(rpc_error) from None tritonclient.utils.InferenceServerException: [StatusCode.UNAVAILABLE] failed to connect to all addresses; last error: UNKNOWN: ipv6:%5B::1%5D:8001: Failed to connect to remote host: Connection refused