triton-inference-server / model_analyzer

Triton Model Analyzer is a CLI tool to help with better understanding of the compute and memory requirements of the Triton Inference Server models.
Apache License 2.0
419 stars 74 forks source link

DCGM initialization error #910

Open minhhoai1001 opened 2 months ago

minhhoai1001 commented 2 months ago

I run docker on server A100: docker run -it --rm --gpus all --net=host \ -v /var/run/docker.sock:/var/run/docker.sock \ -v ${PWD}:/workspace/ --shm-size 8G \ nvcr.io/nvidia/tritonserver:22.12-py3-sdk

then I run: model-analyzer profile --model-repository /workspace/model_repository --profile-models feature_extract --triton-launch-mode=docker --triton-docker-shm-size=8G --output-model-repository-path /workspace/model_optimizer/feature_extract --export-path ./report

I got error: [Model Analyzer] Initializing GPUDevice handles CacheManager Init Failed. Error: -17 Traceback (most recent call last): File "/usr/local/bin/model-analyzer", line 8, in sys.exit(main()) File "/usr/local/lib/python3.8/dist-packages/model_analyzer/entrypoint.py", line 251, in main gpus = GPUDeviceFactory().verify_requested_gpus(config.gpus) File "/usr/local/lib/python3.8/dist-packages/model_analyzer/device/gpu_device_factory.py", line 36, in init self.init_all_devices() File "/usr/local/lib/python3.8/dist-packages/model_analyzer/device/gpu_device_factory.py", line 55, in init_all_devices dcgm_handle = dcgm_agent.dcgmStartEmbedded( File "/usr/local/lib/python3.8/dist-packages/model_analyzer/monitor/dcgm/dcgm_agent.py", line 41, in dcgmStartEmbedded dcgm_structs._dcgmCheckReturn(ret) File "/usr/local/lib/python3.8/dist-packages/model_analyzer/monitor/dcgm/dcgm_structs.py", line 646, in _dcgmCheckReturn raise DCGMError(ret) model_analyzer.monitor.dcgm.dcgm_structs.DCGMError_InitError: DCGM initialization error

nv-braf commented 1 month ago

I see that you are running an older version of tritonserver (22.12). Can you please update to a more recent version and see if that resolves your issue?