Open benhgm opened 1 year ago
Hi @benhgm, by default when Model Analyzer is run --triton-launch-mode=docker
, the docker spun up will be the matching xx.yy-py3 triton server image from the Nvidia NGC. It looks like you have supplied a custom docker image called devel
that will be used instead. What does that container contain? That image needs to have triton server executable plus anything else needed to run the model. The easiest way to do that if you need something special in the container is to build off of the NGC container. If you don't need anything special, then you can omit the triton-docker-image
option
Hi @tgerdesnv thanks for the tip!
To clarify, the command that gives the 'ModuleNotFoundError' error does not include the --triton-docker-image devel
flag. I had incorrectly placed that in.
To provide some context, devel
is a container that is built to serve my models in tritonserver. It has all the dependencies that I need for my models to work. However, when I run the command with '--triton-docker-image devel', I can the error message docker.errors.ImageNotFound: 404 Client Error for http+docker://localhost/v1.41/images/create?tag=latest&fromImage=devel: Not Found ("pull access denied for devel, repository does not exist or may require 'docker login': denied: requested access to the resource is denied")
.
Maybe my experience with docker is still not so good, but here are some questions I have:
--triton-doker-image
, do I provide the Image ID, the Image Repository Name or the Image Tag?docker images
inside the tritonserver container, loaded by the command docker run -it --gpus all \ -v /var/run/docker.sock:/var/run/docker.sock --net=host [nvcr.io/nvidia/tritonserver:23.02-py3-sdk](http://nvcr.io/nvidia/tritonserver:23.02-py3-sdk)
(including mounting all the volumes i need), I get a bash: docker: command not found' error. How do I then pass an existing docker image into the
--triton-docker-image` flag?Thanks for your time and help, greatly appreciate it.
Support for custom local docker images was not added unit the 23.03 release. Can you try running on that (or a newer version) and let me know if you are still seeing an issue. Thanks.
Hi @nv-braf I have changed to use the 23.03 release. When I start up the local docker image instance, I get an error in my triton log file, tritonserver: unrecognized option '--metrics-interval-ms=1000'
. I did not pass that flag anywhere in my local docker instance, hence I'm not sure how it got there. Note that my local docker instance is 21.08.
NVIDIA Release 21.08 (build 26170506)
Copyright (c) 2018-2021, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
Various files include modifications (c) NVIDIA CORPORATION. All rights reserved.
This container image and its contents are governed by the NVIDIA Deep Learning Container License.
By pulling and using the container, you accept the terms and conditions of this license:
https://developer.nvidia.com/ngc/nvidia-deep-learning-container-license
tritonserver: unrecognized option '--metrics-interval-ms=1000'
Usage: tritonserver [options]
--help
Print usage
--log-verbose <integer>
Set verbose logging level. Zero (0) disables verbose logging
and values >= 1 enable verbose logging.
--log-info <boolean>
Enable/disable info-level logging.
--log-warning <boolean>
Enable/disable warning-level logging.
--log-error <boolean>
Enable/disable error-level logging.
--id <string>
Identifier for this server.
--model-store <string>
Equivalent to --model-repository.
--model-repository <string>
Path to model repository directory. It may be specified
multiple times to add multiple model repositories. Note that if a model
is not unique across all model repositories at any time, the model
will not be available.
--exit-on-error <boolean>
Exit the inference server if an error occurs during
initialization.
--strict-model-config <boolean>
If true model configuration files must be provided and all
required configuration settings must be specified. If false the model
configuration may be absent or only partially specified and the
server will attempt to derive the missing required configuration.
--strict-readiness <boolean>
If true /v2/health/ready endpoint indicates ready if the
server is responsive and all models are available. If false
/v2/health/ready endpoint indicates ready if server is responsive even if
some/all models are unavailable.
--allow-http <boolean>
Allow the server to listen for HTTP requests.
--http-port <integer>
The port for the server to listen on for HTTP requests.
--http-thread-count <integer>
Number of threads handling HTTP requests.
--allow-grpc <boolean>
Allow the server to listen for GRPC requests.
--grpc-port <integer>
The port for the server to listen on for GRPC requests.
--grpc-infer-allocation-pool-size <integer>
The maximum number of inference request/response objects
that remain allocated for reuse. As long as the number of in-flight
requests doesn't exceed this value there will be no
allocation/deallocation of request/response objects.
--grpc-use-ssl <boolean>
Use SSL authentication for GRPC requests. Default is false.
--grpc-use-ssl-mutual <boolean>
Use mututal SSL authentication for GRPC requests. Default is
false.
--grpc-server-cert <string>
File holding PEM-encoded server certificate. Ignored unless
--grpc-use-ssl is true.
--grpc-server-key <string>
File holding PEM-encoded server key. Ignored unless
--grpc-use-ssl is true.
--grpc-root-cert <string>
File holding PEM-encoded root certificate. Ignore unless
--grpc-use-ssl is false.
--grpc-infer-response-compression-level <string>
The compression level to be used while returning the infer
response to the peer. Allowed values are none, low, medium and high.
By default, compression level is selected as none.
--grpc-keepalive-time <integer>
The period (in milliseconds) after which a keepalive ping is
sent on the transport. Default is 7200000 (2 hours).
--grpc-keepalive-timeout <integer>
The period (in milliseconds) the sender of the keepalive
ping waits for an acknowledgement. If it does not receive an
acknowledgment within this time, it will close the connection. Default is
20000 (20 seconds).
--grpc-keepalive-permit-without-calls <boolean>
Allows keepalive pings to be sent even if there are no calls
in flight (0 : false; 1 : true). Default is 0 (false).
--grpc-http2-max-pings-without-data <integer>
The maximum number of pings that can be sent when there is
no data/header frame to be sent. gRPC Core will not continue sending
pings if we run over the limit. Setting it to 0 allows sending pings
without such a restriction. Default is 2.
--grpc-http2-min-recv-ping-interval-without-data <integer>
If there are no data/header frames being sent on the
transport, this channel argument on the server side controls the minimum
time (in milliseconds) that gRPC Core would expect between receiving
successive pings. If the time between successive pings is less than
this time, then the ping will be considered a bad ping from the peer.
Such a ping counts as a ‘ping strike’. Default is 300000 (5
minutes).
--grpc-http2-max-ping-strikes <integer>
Maximum number of bad pings that the server will tolerate
before sending an HTTP2 GOAWAY frame and closing the transport.
Setting it to 0 allows the server to accept any number of bad pings.
Default is 2.
--allow-sagemaker <boolean>
Allow the server to listen for Sagemaker requests. Default
is false.
--sagemaker-port <integer>
The port for the server to listen on for Sagemaker requests.
Default is 8080.
--sagemaker-safe-port-range <<integer>-<integer>>
Set the allowed port range for endpoints other than the
SageMaker endpoints.
--sagemaker-thread-count <integer>
Number of threads handling Sagemaker requests. Default is 8.
--allow-metrics <boolean>
Allow the server to provide prometheus metrics.
--allow-gpu-metrics <boolean>
Allow the server to provide GPU metrics. Ignored unless
--allow-metrics is true.
--metrics-port <integer>
The port reporting prometheus metrics.
--trace-file <string>
Set the file where trace output will be saved.
--trace-level <string>
Set the trace level. OFF to disable tracing, MIN for minimal
tracing, MAX for maximal tracing. Default is OFF.
--trace-rate <integer>
Set the trace sampling rate. Default is 1000.
--model-control-mode <string>
Specify the mode for model management. Options are "none",
"poll" and "explicit". The default is "none". For "none", the server
will load all models in the model repository(s) at startup and will
not make any changes to the load models after that. For "poll", the
server will poll the model repository(s) to detect changes and will
load/unload models based on those changes. The poll rate is
controlled by 'repository-poll-secs'. For "explicit", model load and unload
is initiated by using the model control APIs, and only models
specified with --load-model will be loaded at startup.
--repository-poll-secs <integer>
Interval in seconds between each poll of the model
repository to check for changes. Valid only when --model-control-mode=poll is
specified.
--load-model <string>
Name of the model to be loaded on server startup. It may be
specified multiple times to add multiple models. Note that this
option will only take affect if --model-control-mode=explicit is true.
--pinned-memory-pool-byte-size <integer>
The total byte size that can be allocated as pinned system
memory. If GPU support is enabled, the server will allocate pinned
system memory to accelerate data transfer between host and devices
until it exceeds the specified byte size. If 'numa-node' is configured
via --host-policy, the pinned system memory of the pool size will be
allocated on each numa node. This option will not affect the
allocation conducted by the backend frameworks. Default is 256 MB.
--cuda-memory-pool-byte-size <<integer>:<integer>>
The total byte size that can be allocated as CUDA memory for
the GPU device. If GPU support is enabled, the server will allocate
CUDA memory to minimize data transfer between host and devices
until it exceeds the specified byte size. This option will not affect
the allocation conducted by the backend frameworks. The argument
should be 2 integers separated by colons in the format <GPU device
ID>:<pool byte size>. This option can be used multiple times, but only
once per GPU device. Subsequent uses will overwrite previous uses for
the same GPU device. Default is 64 MB.
@benhgm Are you able to move to a newer version of triton server, ideally 23.03 to match your SDK version (or move both to the latest 23.05 release)? As you have observed, using different versions between the two can cause incompatibilities.
@tgerdesnv thanks for the advice! I tried that and was able to run a full analysis on my ensemble model. I got some very nice results and report, but there is now one small error/glitch that I see, where the model analyzer reported that No GPU metric corresponding to tag 'gpu_used_memory' found in the model's measurement. Possibly comparing measurements across devices.
and No GPU metric corresponding to tag 'gpu_used_memory' found in the model's measurement. Possibly comparing measurements across devices.
From the message, I guess this is because I ran the analysis over a multi-GPU instance, and if I set the --gpus flag to a specific GPU UUID, I will be able to get these metrics. I will try it out and update if I face the same error.
Otherwise, how can I enable GPU metrics reporting even on a multi-GPU instance?
This warning occurs when a measurement returning from Perf Analyzer does not contain a GPU metric, in this case, the amount of memory used by the GPU, when it was expected.
Yes, please try to specify the GPU you want to profile on with the --gpus
flag and let me know if this doesn't remove the warning.
@nv-braf hello, I tried by setting --gpus
to the UUID of the GPU I want to use, and model analysis began with the correct GPU that I specified. However, at the end of the analysis, I got the same error messages. Are there any other workarounds I can try?
Are measurements being taken? Are the charts/data being outputted correctly at the end of profile? If so, then it's probably safe to ignore this warning message.
Have you specified CPU_ONLY anywhere in the original configuration? Do the resulting output model configurations have KIND_CPU or KIND_GPU under instance_group
?
My concern if you are getting no GPU metrics is that nothing is actually running on the GPU.
Hi @tgerdesnv you make a good point. I realised that although I had put KIND_GPU for all my models, in my pre and post-processing models, I did not explicitly pass the models to GPU using a .to(torch.device("cuda"))
.
However, my main inference model (a CNN) has always been set to run in GPU. Which I am puzzled over as to why no GPU metrics were recorded for that.
Can you answer @nv-braf 's question?
Are measurements being taken? Are the charts/data being outputted correctly at the end of profile? If so, then it's probably safe to ignore this warning message.
Those warnings may show up if any of the models are running entirely on the CPU.
Hi @tgerdesnv @nv-braf my apologies, I missed out on the other question.
Yes I am getting measurements on the latency and throughput, those are fine. I was just wondering how to make the GPU metrics appear.
@tgerdesnv I understand what you mean. However, if in an ensemble model, for example, where I have a pipeline of pre processing model -> CNN -> post processing model and only the CNN is on GPU, should I expect GPU metrics to be recorded from the CNN side even though the pre and post processing models are on CPU?
As long as you have not set the cpu_only
flag I would expect the composing config to gather GPU metrics and they should be shown in the summary report. Can you confirm that you are not seeing any GPU metrics (like GPU utilization or GPU memory usage) in the summary report table?
Hi @nv-braf yes I confirm that I did not use the cpu_only
flag and I did not encounter any GPU metrics.
i am running the examples/add_sub with local model , and with cpu instances but i am getting follwoing error log in docker container
root@cfbe7ff7cf1e:/app/ma# model-analyzer profile \
--model-repository /app/ma/examples/quick-start \
--profile-models add_sub \
--output-model-repository-path /app/ma/output11 \
--export-path profile_results --triton-launch-mode=local
[Model Analyzer] Starting a local Triton Server
[Model Analyzer] Loaded checkpoint from file /app/ma/checkpoints/0.ckpt
[Model Analyzer] GPU devices match checkpoint - skipping server metric acquisition
[Model Analyzer] Starting a local Triton Server
[Model Analyzer] Model add_sub load failed: [StatusCode.INTERNAL] failed to load 'add_sub', failed to poll from model repository
[Model Analyzer] Model readiness failed for model add_sub. Error [StatusCode.UNAVAILABLE] failed to connect to all addresses; last error: UNKNOWN: ipv4:127.0.0.1:8001: Failed to connect to remote host: Connection refused
[Model Analyzer] Model readiness failed for model add_sub. Error [StatusCode.UNAVAILABLE] failed to connect to all addresses; last error: UNKNOWN: ipv6:%5B::1%5D:8001: Failed to connect to remote host: Connection refused
[Model Analyzer] Saved checkpoint to /app/ma/checkpoints/1.ckpt
Traceback (most recent call last):
File "/opt/app_venv/bin/model-analyzer", line 8, in
Hi, I am trying to use model analyzer to analyze an ensemble model that contains two python models and 1 ONNX model. The python models using pytorch to perform some preprocessing and postprocessing functions.
However, when I use the following command, I get a "ModuleNotFoundError: no module named 'torch'" error.
model-analyzer profile \ --model-repository=/model_repository \ --profile-models=ensemble_model --triton-launch-mode=docker \ --triton-http-endpoint=localhost:8000 --triton-grpc-endpoint=localhost:8003 --triton-metrics-url=localhost:8002 \ --output-model-repository-path=/model_analyzer_outputs/ \ --override-output-model-repository \ --run-config-search-mode quick \ --triton-output-path triton_log.txt \ --triton-docker-image devel
How do i make sure that the docker container spun up by model analyzer has pytorch installed?