Open kyosukegg opened 5 months ago
The version of DCGM you are using is incompatible with the version of MA you are running. You can fix this by using a more recent (within the last 6 months) release of MA.
I tried MA version=1.40.0, but I got the same error.
Does DCGM mean Data Center GPU Manager? I don't have DCGM installed, but is it something I should do?
When I printed the dcgmPath
on line 54 of model_analyzer/device/gpu_device_factory.py
, it was None
.
I understand. I will try again after installing DCGM.
@nv-braf Unfortunately, it was not resolved. DCGM was already installed in the container, and the GPU was correctly recognized, so I tried installing it on the host side, but it did not improve. In docker container,
$ dcgmi discovery -l
1 GPU found.
+--------+----------------------------------------------------------------------+
| GPU ID | Device Information |
+--------+----------------------------------------------------------------------+
| 0 | Name: NVIDIA GeForce RTX 4070 |
| | PCI Bus ID: 00000000:01:00.0 |
| | Device UUID: GPU-eb2680cc-69ce-69df-3073-7268cefc8776 |
+--------+----------------------------------------------------------------------+
0 NvSwitches found.
+-----------+
| Switch ID |
+-----------+
+-----------+
Do you have any ideas?
If you still have the problem check #931
Thanks for your information. Now I have a similar solution. In gpu_device_factory.py, at l.41,
def init_all_devices(self, dcgmPath=None):
"""
Create GPUDevice objects for all DCGM visible
devices.
Parameters
----------
dcgmPath : str
Absolute path to dcgm shared library
"""
if numba.cuda.is_available():
logger.info("Initializing GPUDevice handles")
structs._dcgmInit(dcgmPath)
dcgm_agent.dcgmInit()
# Start DCGM in the embedded mode to use the shared library
dcgm_handle = dcgm_agent.dcgmStartEmbedded(
structs.DCGM_OPERATION_MODE_MANUAL
)
# Create a GPU device for every supported DCGM device
dcgm_device_ids = dcgm_agent.dcgmGetAllSupportedDevices(dcgm_handle)
for device_id in dcgm_device_ids:
device_atrributes = dcgm_agent.dcgmGetDeviceAttributes(
dcgm_handle, device_id
).identifiers
# <My custom change>-----------------------------------------
try:
pci_bus_id = device_atrributes.pciBusId
device_uuid = device_atrributes.uuid
device_name = device_atrributes.deviceName
except UnicodeDecodeError:
import os
keys = ['Name', 'PCI Bus ID', 'Device UUID']
device_atrributes = {}
stream = os.popen('dcgmi discovery -l')
output = stream.read()
output = output.splitlines()
for i, char in enumerate(output):
if char[2] == str(device_id):
attributes = output[i: i+len(keys)]
for key in keys:
for row in attributes:
pos = row.find(key)
if pos != -1:
sindex = row.find(':') + 1
eindex = row.rfind('|')
value = row[sindex:eindex].lstrip().rstrip()
device_atrributes[key] = value
pci_bus_id = device_atrributes[keys[1]]
device_uuid = device_atrributes[keys[2]]
device_name = device_atrributes[keys[0]]
# ----------------------------------------------------------------
When I used model-analyzer, I got "UnicodeDecodeError: 'utf-8' codec can't decode byte 0xf8 in position 0: invalid start byte". I have the same problem with the latest tag:24.05-py3-sdk. Why do I get such an error? And how can I get rid of it?
[compose.yml]
[command]
[Error]
[value of device_atrributes.deviceName]
[environment]
Best regards,