influxdata / telegraf

Agent for collecting, processing, aggregating, and writing metrics, logs, and other arbitrary data.
https://influxdata.com/telegraf
MIT License
14.6k stars 5.57k forks source link

Support Nvidia DCGM #5812

Closed yee379 closed 1 year ago

yee379 commented 5 years ago

Feature Request

There is a more comprehensive monitoring solution for nvidia gpus called DCGM. it would be great if telegraf could read from and export the data from it.

exporting of metrics to collectd and prometheus are already available.

Proposal:

Current behavior:

Monitoring of nvidia graphics cards is only via nvidia-smi. other statistics such as pcie/nvlink utilisation would be very useful that is provided via dcgm.

Desired behavior:

see above.

Use case:

monitoring gpu resources in large clusters is very important to us.

powersj commented 2 years ago

Hi,

Sorry no one has gotten back to you on this.

I think this might be interesting for others to have, however, I would be curious to see what other stats DCGM could offer over and above nvidia-smi. Other than the utilization are there other stats?

Thanks!

vallerul commented 2 years ago

Hello, I too have a hard requirement for DCGM to work with telegraf. Telegraf GPU plugin is broken, when I enable MIG. We use NVIDIA A100 with MIG enabled, and there is no way to get metrics from nvidia-smi for GPUs with MIG. The only way that I could get metrics is using DCGM. Also, for non-mig enabled GPUs, DCGM can give process based metrics instead of GPU based metrics. This will be helpful when multiple processes run on the same GPU.

powersj commented 2 years ago

Telegraf GPU plugin is broken

Did you file an issue with your config and error messages that you are seeing? Adding a comment on a feature request is not the way to get this fixed.

no way to get metrics from nvidia-smi for GPUs with MIG

Is the output wrong? doesn't work?

MIG

Is my internet searching correct and this means multi-instance GPUs?

vallerul commented 2 years ago

Yes - I understand, I have not raised an issue yet. i will raise one soon. There are no errors. It's just that it will not show any metrics, since the processes are running in each GPU instance. Yes - MIG means Multi-Instance GPUs. I take back my statement that nvidia-smi does not show MIG info. It does show some information, but not the required info like Utilization metrics:

0 7 0 14 1 0 0 0 0 0 9728 MiB 4332 MiB 5395 MiB 16383 MiB 2 MiB 16381 MiB
carlos-encs commented 1 year ago

It's been a year since the last comment. Any news about DCGM integration to telegraf ?

Regards

powersj commented 1 year ago

@carlos-encs @vallerul,

Could one of you provide a full output of nvidia -q -x with one of these devices? The examples we have all do not have any MIG data in them.

Additionally, what values are you interested in collecting and reporting?

Thanks

carlos-encs commented 1 year ago

@powersj I hope this will help:

$ nvidia-smi -L
GPU 0: NVIDIA A100-SXM4-80GB (UUID: GPU-513536b6-7d19-9063-b049-1e69664bb298)
  MIG 1g.20gb     Device  0: (UUID: MIG-ce861fb0-a248-58ce-a9a8-f9adbc22a66a)
  MIG 1g.20gb     Device  1: (UUID: MIG-a45882f0-9215-5733-a0c1-49286ab31cb5)
  MIG 1g.20gb     Device  2: (UUID: MIG-f00d722c-f4e1-52c9-aa31-750441ae1bfe)
  MIG 1g.20gb     Device  3: (UUID: MIG-930574a9-d43a-5d2b-95da-3e68384c8d6b)
GPU 1: NVIDIA A100-SXM4-80GB (UUID: GPU-d8594bd4-0fc3-8595-e355-7c98138f3b95)
GPU 2: NVIDIA A100-SXM4-80GB (UUID: GPU-5e375e7e-b41f-155b-dcdc-534524c356fc)
GPU 3: NVIDIA A100-SXM4-80GB (UUID: GPU-69ec2344-ebac-b028-f88f-452abb3d7f11)
$ nvidia-smi -q -x
<?xml version="1.0" ?>
<!DOCTYPE nvidia_smi_log SYSTEM "nvsmi_device_v12.dtd">
<nvidia_smi_log>
        <timestamp>Fri Aug  4 11:44:30 2023</timestamp>
        <driver_version>535.54.03</driver_version>
        <cuda_version>12.2</cuda_version>
        <attached_gpus>4</attached_gpus>
        <gpu id="00000000:01:00.0">
                <product_name>NVIDIA A100-SXM4-80GB</product_name>
                <product_brand>NVIDIA</product_brand>
                <product_architecture>Ampere</product_architecture>
                <display_mode>Enabled</display_mode>
                <display_active>Disabled</display_active>
                <persistence_mode>Disabled</persistence_mode>
                <addressing_mode>None</addressing_mode>
                <mig_mode>
                        <current_mig>Enabled</current_mig>
                        <pending_mig>Enabled</pending_mig>
                </mig_mode>
                <mig_devices>
                <mig_device>
                        <index>0</index>
                        <gpu_instance_id>3</gpu_instance_id>
                        <compute_instance_id>0</compute_instance_id>
                        <device_attributes>
                                <shared>
                                        <multiprocessor_count>14</multiprocessor_count>
                                        <copy_engine_count>1</copy_engine_count>
                                        <encoder_count>0</encoder_count>
                                        <decoder_count>1</decoder_count>
                                        <ofa_count>0</ofa_count>
                                        <jpg_count>0</jpg_count>
                                </shared>
                        </device_attributes>
                        <ecc_error_count>
                                <volatile_count>
                                        <sram_uncorrectable>0</sram_uncorrectable>
                                </volatile_count>
                        </ecc_error_count>
                        <fb_memory_usage>
                                <total>19968 MiB</total>
                                <reserved>0 MiB</reserved>
                                <used>12 MiB</used>
                                <free>19955 MiB</free>
                        </fb_memory_usage>
                        <bar1_memory_usage>
                                <total>32767 MiB</total>
                                <used>0 MiB</used>
                                <free>32767 MiB</free>
                        </bar1_memory_usage>
                </mig_device>
                <mig_device>
                        <index>1</index>
                        <gpu_instance_id>4</gpu_instance_id>
                        <compute_instance_id>0</compute_instance_id>
                        <device_attributes>
                                <shared>
                                        <multiprocessor_count>14</multiprocessor_count>
                                        <copy_engine_count>1</copy_engine_count>
                                        <encoder_count>0</encoder_count>
                                        <decoder_count>1</decoder_count>
                                        <ofa_count>0</ofa_count>
                                        <jpg_count>0</jpg_count>
                                </shared>
                        </device_attributes>
                        <ecc_error_count>
                                <volatile_count>
                                        <sram_uncorrectable>0</sram_uncorrectable>
                                </volatile_count>
                        </ecc_error_count>
                        <fb_memory_usage>
                                <total>19968 MiB</total>
                                <reserved>0 MiB</reserved>
                                <used>12 MiB</used>
                                <free>19955 MiB</free>
                        </fb_memory_usage>
                        <bar1_memory_usage>
                                <total>32767 MiB</total>
                                <used>0 MiB</used>
                                <free>32767 MiB</free>
                        </bar1_memory_usage>
                </mig_device>
                <mig_device>
                        <index>2</index>
                        <gpu_instance_id>5</gpu_instance_id>
                        <compute_instance_id>0</compute_instance_id>
                        <device_attributes>
                                <shared>
                                        <multiprocessor_count>14</multiprocessor_count>
                                        <copy_engine_count>1</copy_engine_count>
                                        <encoder_count>0</encoder_count>
                                        <decoder_count>1</decoder_count>
                                        <ofa_count>0</ofa_count>
                                        <jpg_count>0</jpg_count>
                                </shared>
                        </device_attributes>
                        <ecc_error_count>
                                <volatile_count>
                                        <sram_uncorrectable>0</sram_uncorrectable>
                                </volatile_count>
                        </ecc_error_count>
                        <fb_memory_usage>
                                <total>19968 MiB</total>
                                <reserved>0 MiB</reserved>
                                <used>12 MiB</used>
                                <free>19955 MiB</free>
                        </fb_memory_usage>
                        <bar1_memory_usage>
                                <total>32767 MiB</total>
                                <used>0 MiB</used>
                                <free>32767 MiB</free>
                        </bar1_memory_usage>
                </mig_device>
                <mig_device>
                        <index>3</index>
                        <gpu_instance_id>6</gpu_instance_id>
                        <compute_instance_id>0</compute_instance_id>
                        <device_attributes>
                                <shared>
                                        <multiprocessor_count>14</multiprocessor_count>
                                        <copy_engine_count>1</copy_engine_count>
                                        <encoder_count>0</encoder_count>
                                        <decoder_count>1</decoder_count>
                                        <ofa_count>0</ofa_count>
                                        <jpg_count>0</jpg_count>
                                </shared>
                        </device_attributes>
                        <ecc_error_count>
                                <volatile_count>
                                        <sram_uncorrectable>0</sram_uncorrectable>
                                </volatile_count>
                        </ecc_error_count>
                        <fb_memory_usage>
                                <total>19968 MiB</total>
                                <reserved>0 MiB</reserved>
                                <used>12 MiB</used>
                                <free>19955 MiB</free>
                        </fb_memory_usage>
                        <bar1_memory_usage>
                                <total>32767 MiB</total>
                                <used>0 MiB</used>
                                <free>32767 MiB</free>
                        </bar1_memory_usage>
                </mig_device>
                </mig_devices>
                <accounting_mode>Disabled</accounting_mode>
                <accounting_mode_buffer_size>4000</accounting_mode_buffer_size>
                <driver_model>
                        <current_dm>N/A</current_dm>
                        <pending_dm>N/A</pending_dm>
                </driver_model>
                <serial>1650522003820</serial>
                <uuid>GPU-513536b6-7d19-9063-b049-1e69664bb298</uuid>
                <minor_number>1</minor_number>
                <vbios_version>92.00.36.00.02</vbios_version>
                <multigpu_board>No</multigpu_board>
                <board_id>0x100</board_id>
                <board_part_number>692-2G506-0212-002</board_part_number>
                <gpu_part_number>20B2-895-A1</gpu_part_number>
                <gpu_fru_part_number>N/A</gpu_fru_part_number>
                <gpu_module_id>4</gpu_module_id>
                <inforom_version>
                        <img_version>G506.0212.00.01</img_version>
                        <oem_object>2.0</oem_object>
                        <ecc_object>6.16</ecc_object>
                        <pwr_object>N/A</pwr_object>
                </inforom_version>
                <gpu_operation_mode>
                        <current_gom>N/A</current_gom>
                        <pending_gom>N/A</pending_gom>
                </gpu_operation_mode>
                <gsp_firmware_version>535.54.03</gsp_firmware_version>
                <gpu_virtualization_mode>
                        <virtualization_mode>None</virtualization_mode>
                        <host_vgpu_mode>N/A</host_vgpu_mode>
                </gpu_virtualization_mode>
                <gpu_reset_status>
                        <reset_required>No</reset_required>
                        <drain_and_reset_recommended>No</drain_and_reset_recommended>
                </gpu_reset_status>
                <ibmnpu>
                        <relaxed_ordering_mode>N/A</relaxed_ordering_mode>
                </ibmnpu>
                <pci>
                        <pci_bus>01</pci_bus>
                        <pci_device>00</pci_device>
                        <pci_domain>0000</pci_domain>
                        <pci_device_id>20B210DE</pci_device_id>
                        <pci_bus_id>00000000:01:00.0</pci_bus_id>
                        <pci_sub_system_id>147F10DE</pci_sub_system_id>
                        <pci_gpu_link_info>
                                <pcie_gen>
                                        <max_link_gen>4</max_link_gen>
                                        <current_link_gen>4</current_link_gen>
                                        <device_current_link_gen>4</device_current_link_gen>
                                        <max_device_link_gen>4</max_device_link_gen>
                                        <max_host_link_gen>4</max_host_link_gen>
                                </pcie_gen>
                                <link_widths>
                                        <max_link_width>16x</max_link_width>
                                        <current_link_width>16x</current_link_width>
                                </link_widths>
                        </pci_gpu_link_info>
                        <pci_bridge_chip>
                                <bridge_chip_type>N/A</bridge_chip_type>
                                <bridge_chip_fw>N/A</bridge_chip_fw>
                        </pci_bridge_chip>
                        <replay_counter>0</replay_counter>
                        <replay_rollover_counter>0</replay_rollover_counter>
                        <tx_util>4000 KB/s</tx_util>
                        <rx_util>0 KB/s</rx_util>
                        <atomic_caps_inbound>N/A</atomic_caps_inbound>
                        <atomic_caps_outbound>N/A</atomic_caps_outbound>
                </pci>
                <fan_speed>N/A</fan_speed>
                <performance_state>P0</performance_state>
                <clocks_event_reasons>
                        <clocks_event_reason_gpu_idle>Not Active</clocks_event_reason_gpu_idle>
                        <clocks_event_reason_applications_clocks_setting>Not Active</clocks_event_reason_applications_clocks_setting>
                        <clocks_event_reason_sw_power_cap>Not Active</clocks_event_reason_sw_power_cap>
                        <clocks_event_reason_hw_slowdown>Not Active</clocks_event_reason_hw_slowdown>
                        <clocks_event_reason_hw_thermal_slowdown>Not Active</clocks_event_reason_hw_thermal_slowdown>
                        <clocks_event_reason_hw_power_brake_slowdown>Not Active</clocks_event_reason_hw_power_brake_slowdown>
                        <clocks_event_reason_sync_boost>Not Active</clocks_event_reason_sync_boost>
                        <clocks_event_reason_sw_thermal_slowdown>Not Active</clocks_event_reason_sw_thermal_slowdown>
                        <clocks_event_reason_display_clocks_setting>Not Active</clocks_event_reason_display_clocks_setting>
                </clocks_event_reasons>
                <fb_memory_usage>
                        <total>81920 MiB</total>
                        <reserved>869 MiB</reserved>
                        <used>50 MiB</used>
                        <free>80999 MiB</free>
                </fb_memory_usage>
                <bar1_memory_usage>
                        <total>131072 MiB</total>
                        <used>1 MiB</used>
                        <free>131071 MiB</free>
                </bar1_memory_usage>
                <cc_protected_memory_usage>
                        <total>0 MiB</total>
                        <used>0 MiB</used>
                        <free>0 MiB</free>
                </cc_protected_memory_usage>
                <compute_mode>Default</compute_mode>
                <utilization>
                        <gpu_util>N/A</gpu_util>
                        <memory_util>N/A</memory_util>
                        <encoder_util>N/A</encoder_util>
                        <decoder_util>N/A</decoder_util>
                        <jpeg_util>N/A</jpeg_util>
                        <ofa_util>N/A</ofa_util>
                </utilization>
                <encoder_stats>
                        <session_count>0</session_count>
                        <average_fps>0</average_fps>
                        <average_latency>0</average_latency>
                </encoder_stats>
                <fbc_stats>
                        <session_count>0</session_count>
                        <average_fps>0</average_fps>
                        <average_latency>0</average_latency>
                </fbc_stats>
                <ecc_mode>
                        <current_ecc>Enabled</current_ecc>
                        <pending_ecc>Enabled</pending_ecc>
                </ecc_mode>
                <ecc_errors>
                        <volatile>
                                <sram_correctable>0</sram_correctable>
                                <sram_uncorrectable>0</sram_uncorrectable>
                                <dram_correctable>0</dram_correctable>
                                <dram_uncorrectable>0</dram_uncorrectable>
                        </volatile>
                        <aggregate>
                                <sram_correctable>0</sram_correctable>
                                <sram_uncorrectable>0</sram_uncorrectable>
                                <dram_correctable>0</dram_correctable>
                                <dram_uncorrectable>0</dram_uncorrectable>
                        </aggregate>
                </ecc_errors>
                <retired_pages>
                        <multiple_single_bit_retirement>
                                <retired_count>N/A</retired_count>
                                <retired_pagelist>N/A</retired_pagelist>
                        </multiple_single_bit_retirement>
                        <double_bit_retirement>
                                <retired_count>N/A</retired_count>
                                <retired_pagelist>N/A</retired_pagelist>
                        </double_bit_retirement>
                        <pending_blacklist>N/A</pending_blacklist>
                        <pending_retirement>N/A</pending_retirement>
                </retired_pages>
                <remapped_rows>N/A</remapped_rows>
                <temperature>
                        <gpu_temp>27 C</gpu_temp>
                        <gpu_temp_tlimit>N/A</gpu_temp_tlimit>
                        <gpu_temp_max_threshold>92 C</gpu_temp_max_threshold>
                        <gpu_temp_slow_threshold>89 C</gpu_temp_slow_threshold>
                        <gpu_temp_max_gpu_threshold>85 C</gpu_temp_max_gpu_threshold>
                        <gpu_target_temperature>N/A</gpu_target_temperature>
                        <memory_temp>44 C</memory_temp>
                        <gpu_temp_max_mem_threshold>95 C</gpu_temp_max_mem_threshold>
                </temperature>
                <supported_gpu_target_temp>
                        <gpu_target_temp_min>N/A</gpu_target_temp_min>
                        <gpu_target_temp_max>N/A</gpu_target_temp_max>
                </supported_gpu_target_temp>
                <gpu_power_readings>
                        <power_state>P0</power_state>
                        <power_draw>67.03 W</power_draw>
                        <current_power_limit>500.00 W</current_power_limit>
                        <requested_power_limit>500.00 W</requested_power_limit>
                        <default_power_limit>500.00 W</default_power_limit>
                        <min_power_limit>100.00 W</min_power_limit>
                        <max_power_limit>500.00 W</max_power_limit>
                </gpu_power_readings>
                <module_power_readings>
                        <power_state>P0</power_state>
                        <power_draw>N/A</power_draw>
                        <current_power_limit>N/A</current_power_limit>
                        <requested_power_limit>N/A</requested_power_limit>
                        <default_power_limit>N/A</default_power_limit>
                        <min_power_limit>N/A</min_power_limit>
                        <max_power_limit>N/A</max_power_limit>
                </module_power_readings>
                <clocks>
                        <graphics_clock>1275 MHz</graphics_clock>
                        <sm_clock>1275 MHz</sm_clock>
                        <mem_clock>1593 MHz</mem_clock>
                        <video_clock>1275 MHz</video_clock>
                </clocks>
                <applications_clocks>
                        <graphics_clock>1275 MHz</graphics_clock>
                        <mem_clock>1593 MHz</mem_clock>
                </applications_clocks>
                <default_applications_clocks>
                        <graphics_clock>1275 MHz</graphics_clock>
                        <mem_clock>1593 MHz</mem_clock>
                </default_applications_clocks>
                <deferred_clocks>
                        <mem_clock>N/A</mem_clock>
                </deferred_clocks>
                <max_clocks>
                        <graphics_clock>1410 MHz</graphics_clock>
                        <sm_clock>1410 MHz</sm_clock>
                        <mem_clock>1593 MHz</mem_clock>
                        <video_clock>1290 MHz</video_clock>
                </max_clocks>
                <max_customer_boost_clocks>
                        <graphics_clock>1410 MHz</graphics_clock>
                </max_customer_boost_clocks>
                <clock_policy>
                        <auto_boost>N/A</auto_boost>
                        <auto_boost_default>N/A</auto_boost_default>
                </clock_policy>
                <voltage>
                        <graphics_volt>912.500 mV</graphics_volt>
                </voltage>
                <fabric>
                        <state>N/A</state>
                        <status>N/A</status>
                </fabric>
                <supported_clocks>
                        <supported_mem_clock>
                                <value>1593 MHz</value>
                                <supported_graphics_clock>1410 MHz</supported_graphics_clock>
                                <supported_graphics_clock>1395 MHz</supported_graphics_clock>
                                <supported_graphics_clock>1380 MHz</supported_graphics_clock>
                                <supported_graphics_clock>1365 MHz</supported_graphics_clock>
                                <supported_graphics_clock>1350 MHz</supported_graphics_clock>
                                <supported_graphics_clock>1335 MHz</supported_graphics_clock>
                                <supported_graphics_clock>1320 MHz</supported_graphics_clock>
                                <supported_graphics_clock>1305 MHz</supported_graphics_clock>
                                <supported_graphics_clock>1290 MHz</supported_graphics_clock>
                                <supported_graphics_clock>1275 MHz</supported_graphics_clock>
                                <supported_graphics_clock>1260 MHz</supported_graphics_clock>
                                <supported_graphics_clock>1245 MHz</supported_graphics_clock>
                                <supported_graphics_clock>1230 MHz</supported_graphics_clock>
                                <supported_graphics_clock>1215 MHz</supported_graphics_clock>
                                <supported_graphics_clock>1200 MHz</supported_graphics_clock>
                                <supported_graphics_clock>1185 MHz</supported_graphics_clock>
                                <supported_graphics_clock>1170 MHz</supported_graphics_clock>
                                <supported_graphics_clock>1155 MHz</supported_graphics_clock>
                                <supported_graphics_clock>1140 MHz</supported_graphics_clock>
                                <supported_graphics_clock>1125 MHz</supported_graphics_clock>
                                <supported_graphics_clock>1110 MHz</supported_graphics_clock>
                                <supported_graphics_clock>1095 MHz</supported_graphics_clock>
                                <supported_graphics_clock>1080 MHz</supported_graphics_clock>
                                <supported_graphics_clock>1065 MHz</supported_graphics_clock>
                                <supported_graphics_clock>1050 MHz</supported_graphics_clock>
                                <supported_graphics_clock>1035 MHz</supported_graphics_clock>
                                <supported_graphics_clock>1020 MHz</supported_graphics_clock>
                                <supported_graphics_clock>1005 MHz</supported_graphics_clock>
                                <supported_graphics_clock>990 MHz</supported_graphics_clock>
                                <supported_graphics_clock>975 MHz</supported_graphics_clock>
                                <supported_graphics_clock>960 MHz</supported_graphics_clock>
                                <supported_graphics_clock>945 MHz</supported_graphics_clock>
                                <supported_graphics_clock>930 MHz</supported_graphics_clock>
                                <supported_graphics_clock>915 MHz</supported_graphics_clock>
                                <supported_graphics_clock>900 MHz</supported_graphics_clock>
                                <supported_graphics_clock>885 MHz</supported_graphics_clock>
                                <supported_graphics_clock>870 MHz</supported_graphics_clock>
                                <supported_graphics_clock>855 MHz</supported_graphics_clock>
                                <supported_graphics_clock>840 MHz</supported_graphics_clock>
                                <supported_graphics_clock>825 MHz</supported_graphics_clock>
                                <supported_graphics_clock>810 MHz</supported_graphics_clock>
                                <supported_graphics_clock>795 MHz</supported_graphics_clock>
                                <supported_graphics_clock>780 MHz</supported_graphics_clock>
                                <supported_graphics_clock>765 MHz</supported_graphics_clock>
                                <supported_graphics_clock>750 MHz</supported_graphics_clock>
                                <supported_graphics_clock>735 MHz</supported_graphics_clock>
                                <supported_graphics_clock>720 MHz</supported_graphics_clock>
                                <supported_graphics_clock>705 MHz</supported_graphics_clock>
                                <supported_graphics_clock>690 MHz</supported_graphics_clock>
                                <supported_graphics_clock>675 MHz</supported_graphics_clock>
                                <supported_graphics_clock>660 MHz</supported_graphics_clock>
                                <supported_graphics_clock>645 MHz</supported_graphics_clock>
                                <supported_graphics_clock>630 MHz</supported_graphics_clock>
                                <supported_graphics_clock>615 MHz</supported_graphics_clock>
                                <supported_graphics_clock>600 MHz</supported_graphics_clock>
                                <supported_graphics_clock>585 MHz</supported_graphics_clock>
                                <supported_graphics_clock>570 MHz</supported_graphics_clock>
                                <supported_graphics_clock>555 MHz</supported_graphics_clock>
                                <supported_graphics_clock>540 MHz</supported_graphics_clock>
                                <supported_graphics_clock>525 MHz</supported_graphics_clock>
                                <supported_graphics_clock>510 MHz</supported_graphics_clock>
                                <supported_graphics_clock>495 MHz</supported_graphics_clock>
                                <supported_graphics_clock>480 MHz</supported_graphics_clock>
                                <supported_graphics_clock>465 MHz</supported_graphics_clock>
                                <supported_graphics_clock>450 MHz</supported_graphics_clock>
                                <supported_graphics_clock>435 MHz</supported_graphics_clock>
                                <supported_graphics_clock>420 MHz</supported_graphics_clock>
                                <supported_graphics_clock>405 MHz</supported_graphics_clock>
                                <supported_graphics_clock>390 MHz</supported_graphics_clock>
                                <supported_graphics_clock>375 MHz</supported_graphics_clock>
                                <supported_graphics_clock>360 MHz</supported_graphics_clock>
                                <supported_graphics_clock>345 MHz</supported_graphics_clock>
                                <supported_graphics_clock>330 MHz</supported_graphics_clock>
                                <supported_graphics_clock>315 MHz</supported_graphics_clock>
                                <supported_graphics_clock>300 MHz</supported_graphics_clock>
                                <supported_graphics_clock>285 MHz</supported_graphics_clock>
                                <supported_graphics_clock>270 MHz</supported_graphics_clock>
                                <supported_graphics_clock>255 MHz</supported_graphics_clock>
                                <supported_graphics_clock>240 MHz</supported_graphics_clock>
                                <supported_graphics_clock>225 MHz</supported_graphics_clock>
                                <supported_graphics_clock>210 MHz</supported_graphics_clock>
                        </supported_mem_clock>
                </supported_clocks>
                <processes>
                </processes>
                <accounted_processes>
                </accounted_processes>
        </gpu>

        <gpu id="00000000:41:00.0">
                <product_name>NVIDIA A100-SXM4-80GB</product_name>
                <product_brand>NVIDIA</product_brand>
                <product_architecture>Ampere</product_architecture>
                <display_mode>Enabled</display_mode>
                <display_active>Disabled</display_active>
                <persistence_mode>Disabled</persistence_mode>
                <addressing_mode>None</addressing_mode>
                <mig_mode>
                        <current_mig>Enabled</current_mig>
                        <pending_mig>Enabled</pending_mig>
                </mig_mode>
                <mig_devices>
                        None
                </mig_devices>
                <accounting_mode>Disabled</accounting_mode>
                <accounting_mode_buffer_size>4000</accounting_mode_buffer_size>
                <driver_model>
                        <current_dm>N/A</current_dm>
                        <pending_dm>N/A</pending_dm>
                </driver_model>
                <serial>1650522003974</serial>
                <uuid>GPU-d8594bd4-0fc3-8595-e355-7c98138f3b95</uuid>
                <minor_number>0</minor_number>
                <vbios_version>92.00.36.00.02</vbios_version>
                <multigpu_board>No</multigpu_board>
                <board_id>0x4100</board_id>
                <board_part_number>692-2G506-0212-002</board_part_number>
                <gpu_part_number>20B2-895-A1</gpu_part_number>
                <gpu_fru_part_number>N/A</gpu_fru_part_number>
                <gpu_module_id>3</gpu_module_id>
                <inforom_version>
                        <img_version>G506.0212.00.01</img_version>
                        <oem_object>2.0</oem_object>
                        <ecc_object>6.16</ecc_object>
                        <pwr_object>N/A</pwr_object>
                </inforom_version>
                <gpu_operation_mode>
                        <current_gom>N/A</current_gom>
                        <pending_gom>N/A</pending_gom>
                </gpu_operation_mode>
                <gsp_firmware_version>535.54.03</gsp_firmware_version>
                <gpu_virtualization_mode>
                        <virtualization_mode>None</virtualization_mode>
                        <host_vgpu_mode>N/A</host_vgpu_mode>
                </gpu_virtualization_mode>
                <gpu_reset_status>
                        <reset_required>No</reset_required>
                        <drain_and_reset_recommended>No</drain_and_reset_recommended>
                </gpu_reset_status>
                <ibmnpu>
                        <relaxed_ordering_mode>N/A</relaxed_ordering_mode>
                </ibmnpu>
                <pci>
                        <pci_bus>41</pci_bus>
                        <pci_device>00</pci_device>
                        <pci_domain>0000</pci_domain>
                        <pci_device_id>20B210DE</pci_device_id>
                        <pci_bus_id>00000000:41:00.0</pci_bus_id>
                        <pci_sub_system_id>147F10DE</pci_sub_system_id>
                        <pci_gpu_link_info>
                                <pcie_gen>
                                        <max_link_gen>4</max_link_gen>
                                        <current_link_gen>4</current_link_gen>
                                        <device_current_link_gen>4</device_current_link_gen>
                                        <max_device_link_gen>4</max_device_link_gen>
                                        <max_host_link_gen>4</max_host_link_gen>
                                </pcie_gen>
                                <link_widths>
                                        <max_link_width>16x</max_link_width>
                                        <current_link_width>16x</current_link_width>
                                </link_widths>
                        </pci_gpu_link_info>
                        <pci_bridge_chip>
                                <bridge_chip_type>N/A</bridge_chip_type>
                                <bridge_chip_fw>N/A</bridge_chip_fw>
                        </pci_bridge_chip>
                        <replay_counter>0</replay_counter>
                        <replay_rollover_counter>0</replay_rollover_counter>
                        <tx_util>3000 KB/s</tx_util>
                        <rx_util>0 KB/s</rx_util>
                        <atomic_caps_inbound>N/A</atomic_caps_inbound>
                        <atomic_caps_outbound>N/A</atomic_caps_outbound>
                </pci>
                <fan_speed>N/A</fan_speed>
                <performance_state>P0</performance_state>
                <clocks_event_reasons>
                        <clocks_event_reason_gpu_idle>Not Active</clocks_event_reason_gpu_idle>
                        <clocks_event_reason_applications_clocks_setting>Not Active</clocks_event_reason_applications_clocks_setting>
                        <clocks_event_reason_sw_power_cap>Not Active</clocks_event_reason_sw_power_cap>
                        <clocks_event_reason_hw_slowdown>Not Active</clocks_event_reason_hw_slowdown>
                        <clocks_event_reason_hw_thermal_slowdown>Not Active</clocks_event_reason_hw_thermal_slowdown>
                        <clocks_event_reason_hw_power_brake_slowdown>Not Active</clocks_event_reason_hw_power_brake_slowdown>
                        <clocks_event_reason_sync_boost>Not Active</clocks_event_reason_sync_boost>
                        <clocks_event_reason_sw_thermal_slowdown>Not Active</clocks_event_reason_sw_thermal_slowdown>
                        <clocks_event_reason_display_clocks_setting>Not Active</clocks_event_reason_display_clocks_setting>
                </clocks_event_reasons>
                <fb_memory_usage>
                        <total>81920 MiB</total>
                        <reserved>869 MiB</reserved>
                        <used>0 MiB</used>
                        <free>81050 MiB</free>
                </fb_memory_usage>
                <bar1_memory_usage>
                        <total>131072 MiB</total>
                        <used>1 MiB</used>
                        <free>131071 MiB</free>
                </bar1_memory_usage>
                <cc_protected_memory_usage>
                        <total>0 MiB</total>
                        <used>0 MiB</used>
                        <free>0 MiB</free>
                </cc_protected_memory_usage>
                <compute_mode>Default</compute_mode>
                <utilization>
                        <gpu_util>N/A</gpu_util>
                        <memory_util>N/A</memory_util>
                        <encoder_util>N/A</encoder_util>
                        <decoder_util>N/A</decoder_util>
                        <jpeg_util>N/A</jpeg_util>
                        <ofa_util>N/A</ofa_util>
                </utilization>
                <encoder_stats>
                        <session_count>0</session_count>
                        <average_fps>0</average_fps>
                        <average_latency>0</average_latency>
                </encoder_stats>
                <fbc_stats>
                        <session_count>0</session_count>
                        <average_fps>0</average_fps>
                        <average_latency>0</average_latency>
                </fbc_stats>
                <ecc_mode>
                        <current_ecc>Enabled</current_ecc>
                        <pending_ecc>Enabled</pending_ecc>
                </ecc_mode>
                <ecc_errors>
                        <volatile>
                                <sram_correctable>0</sram_correctable>
                                <sram_uncorrectable>0</sram_uncorrectable>
                                <dram_correctable>0</dram_correctable>
                                <dram_uncorrectable>0</dram_uncorrectable>
                        </volatile>
                        <aggregate>
                                <sram_correctable>0</sram_correctable>
                                <sram_uncorrectable>0</sram_uncorrectable>
                                <dram_correctable>0</dram_correctable>
                                <dram_uncorrectable>0</dram_uncorrectable>
                        </aggregate>
                </ecc_errors>
                <retired_pages>
                        <multiple_single_bit_retirement>
                                <retired_count>N/A</retired_count>
                                <retired_pagelist>N/A</retired_pagelist>
                        </multiple_single_bit_retirement>
                        <double_bit_retirement>
                                <retired_count>N/A</retired_count>
                                <retired_pagelist>N/A</retired_pagelist>
                        </double_bit_retirement>
                        <pending_blacklist>N/A</pending_blacklist>
                        <pending_retirement>N/A</pending_retirement>
                </retired_pages>
                <remapped_rows>
                        <remapped_row_corr>0</remapped_row_corr>
                        <remapped_row_unc>0</remapped_row_unc>
                        <remapped_row_pending>No</remapped_row_pending>
                        <remapped_row_failure>No</remapped_row_failure>
                        <row_remapper_histogram>
                                <row_remapper_histogram_max>640 bank(s)</row_remapper_histogram_max>
                                <row_remapper_histogram_high>0 bank(s)</row_remapper_histogram_high>
                                <row_remapper_histogram_partial>0 bank(s)</row_remapper_histogram_partial>
                                <row_remapper_histogram_low>0 bank(s)</row_remapper_histogram_low>
                                <row_remapper_histogram_none>0 bank(s)</row_remapper_histogram_none>
                        </row_remapper_histogram>
                </remapped_rows>
                <temperature>
                        <gpu_temp>28 C</gpu_temp>
                        <gpu_temp_tlimit>N/A</gpu_temp_tlimit>
                        <gpu_temp_max_threshold>92 C</gpu_temp_max_threshold>
                        <gpu_temp_slow_threshold>89 C</gpu_temp_slow_threshold>
                        <gpu_temp_max_gpu_threshold>85 C</gpu_temp_max_gpu_threshold>
                        <gpu_target_temperature>N/A</gpu_target_temperature>
                        <memory_temp>45 C</memory_temp>
                        <gpu_temp_max_mem_threshold>95 C</gpu_temp_max_mem_threshold>
                </temperature>
                <supported_gpu_target_temp>
                        <gpu_target_temp_min>N/A</gpu_target_temp_min>
                        <gpu_target_temp_max>N/A</gpu_target_temp_max>
                </supported_gpu_target_temp>
                <gpu_power_readings>
                        <power_state>P0</power_state>
                        <power_draw>64.92 W</power_draw>
                        <current_power_limit>500.00 W</current_power_limit>
                        <requested_power_limit>500.00 W</requested_power_limit>
                        <default_power_limit>500.00 W</default_power_limit>
                        <min_power_limit>100.00 W</min_power_limit>
                        <max_power_limit>500.00 W</max_power_limit>
                </gpu_power_readings>
                <module_power_readings>
                        <power_state>P0</power_state>
                        <power_draw>N/A</power_draw>
                        <current_power_limit>N/A</current_power_limit>
                        <requested_power_limit>N/A</requested_power_limit>
                        <default_power_limit>N/A</default_power_limit>
                        <min_power_limit>N/A</min_power_limit>
                        <max_power_limit>N/A</max_power_limit>
                </module_power_readings>
                <clocks>
                        <graphics_clock>1275 MHz</graphics_clock>
                        <sm_clock>1275 MHz</sm_clock>
                        <mem_clock>1593 MHz</mem_clock>
                        <video_clock>1155 MHz</video_clock>
                </clocks>
                <applications_clocks>
                        <graphics_clock>1275 MHz</graphics_clock>
                        <mem_clock>1593 MHz</mem_clock>
                </applications_clocks>
                <default_applications_clocks>
                        <graphics_clock>1275 MHz</graphics_clock>
                        <mem_clock>1593 MHz</mem_clock>
                </default_applications_clocks>
                <deferred_clocks>
                        <mem_clock>N/A</mem_clock>
                </deferred_clocks>
                <max_clocks>
                        <graphics_clock>1410 MHz</graphics_clock>
                        <sm_clock>1410 MHz</sm_clock>
                        <mem_clock>1593 MHz</mem_clock>
                        <video_clock>1290 MHz</video_clock>
                </max_clocks>
                <max_customer_boost_clocks>
                        <graphics_clock>1410 MHz</graphics_clock>
                </max_customer_boost_clocks>
                <clock_policy>
                        <auto_boost>N/A</auto_boost>
                        <auto_boost_default>N/A</auto_boost_default>
                </clock_policy>
                <voltage>
                        <graphics_volt>806.250 mV</graphics_volt>
                </voltage>
                <fabric>
                        <state>N/A</state>
                        <status>N/A</status>
                </fabric>
                <supported_clocks>
                        <supported_mem_clock>
                                <value>1593 MHz</value>
                                <supported_graphics_clock>1410 MHz</supported_graphics_clock>
                                <supported_graphics_clock>1395 MHz</supported_graphics_clock>
                                <supported_graphics_clock>1380 MHz</supported_graphics_clock>
                                <supported_graphics_clock>1365 MHz</supported_graphics_clock>
                                <supported_graphics_clock>1350 MHz</supported_graphics_clock>
                                <supported_graphics_clock>1335 MHz</supported_graphics_clock>
                                <supported_graphics_clock>1320 MHz</supported_graphics_clock>
                                <supported_graphics_clock>1305 MHz</supported_graphics_clock>
                                <supported_graphics_clock>1290 MHz</supported_graphics_clock>
                                <supported_graphics_clock>1275 MHz</supported_graphics_clock>
                                <supported_graphics_clock>1260 MHz</supported_graphics_clock>
                                <supported_graphics_clock>1245 MHz</supported_graphics_clock>
                                <supported_graphics_clock>1230 MHz</supported_graphics_clock>
                                <supported_graphics_clock>1215 MHz</supported_graphics_clock>
                                <supported_graphics_clock>1200 MHz</supported_graphics_clock>
                                <supported_graphics_clock>1185 MHz</supported_graphics_clock>
                                <supported_graphics_clock>1170 MHz</supported_graphics_clock>
                                <supported_graphics_clock>1155 MHz</supported_graphics_clock>
                                <supported_graphics_clock>1140 MHz</supported_graphics_clock>
                                <supported_graphics_clock>1125 MHz</supported_graphics_clock>
                                <supported_graphics_clock>1110 MHz</supported_graphics_clock>
                                <supported_graphics_clock>1095 MHz</supported_graphics_clock>
                                <supported_graphics_clock>1080 MHz</supported_graphics_clock>
                                <supported_graphics_clock>1065 MHz</supported_graphics_clock>
                                <supported_graphics_clock>1050 MHz</supported_graphics_clock>
                                <supported_graphics_clock>1035 MHz</supported_graphics_clock>
                                <supported_graphics_clock>1020 MHz</supported_graphics_clock>
                                <supported_graphics_clock>1005 MHz</supported_graphics_clock>
                                <supported_graphics_clock>990 MHz</supported_graphics_clock>
                                <supported_graphics_clock>975 MHz</supported_graphics_clock>
                                <supported_graphics_clock>960 MHz</supported_graphics_clock>
                                <supported_graphics_clock>945 MHz</supported_graphics_clock>
                                <supported_graphics_clock>930 MHz</supported_graphics_clock>
                                <supported_graphics_clock>915 MHz</supported_graphics_clock>
                                <supported_graphics_clock>900 MHz</supported_graphics_clock>
                                <supported_graphics_clock>885 MHz</supported_graphics_clock>
                                <supported_graphics_clock>870 MHz</supported_graphics_clock>
                                <supported_graphics_clock>855 MHz</supported_graphics_clock>
                                <supported_graphics_clock>840 MHz</supported_graphics_clock>
                                <supported_graphics_clock>825 MHz</supported_graphics_clock>
                                <supported_graphics_clock>810 MHz</supported_graphics_clock>
                                <supported_graphics_clock>795 MHz</supported_graphics_clock>
                                <supported_graphics_clock>780 MHz</supported_graphics_clock>
                                <supported_graphics_clock>765 MHz</supported_graphics_clock>
                                <supported_graphics_clock>750 MHz</supported_graphics_clock>
                                <supported_graphics_clock>735 MHz</supported_graphics_clock>
                                <supported_graphics_clock>720 MHz</supported_graphics_clock>
                                <supported_graphics_clock>705 MHz</supported_graphics_clock>
                                <supported_graphics_clock>690 MHz</supported_graphics_clock>
                                <supported_graphics_clock>675 MHz</supported_graphics_clock>
                                <supported_graphics_clock>660 MHz</supported_graphics_clock>
                                <supported_graphics_clock>645 MHz</supported_graphics_clock>
                                <supported_graphics_clock>630 MHz</supported_graphics_clock>
                                <supported_graphics_clock>615 MHz</supported_graphics_clock>
                                <supported_graphics_clock>600 MHz</supported_graphics_clock>
                                <supported_graphics_clock>585 MHz</supported_graphics_clock>
                                <supported_graphics_clock>570 MHz</supported_graphics_clock>
                                <supported_graphics_clock>555 MHz</supported_graphics_clock>
                                <supported_graphics_clock>540 MHz</supported_graphics_clock>
                                <supported_graphics_clock>525 MHz</supported_graphics_clock>
                                <supported_graphics_clock>510 MHz</supported_graphics_clock>
                                <supported_graphics_clock>495 MHz</supported_graphics_clock>
                                <supported_graphics_clock>480 MHz</supported_graphics_clock>
                                <supported_graphics_clock>465 MHz</supported_graphics_clock>
                                <supported_graphics_clock>450 MHz</supported_graphics_clock>
                                <supported_graphics_clock>435 MHz</supported_graphics_clock>
                                <supported_graphics_clock>420 MHz</supported_graphics_clock>
                                <supported_graphics_clock>405 MHz</supported_graphics_clock>
                                <supported_graphics_clock>390 MHz</supported_graphics_clock>
                                <supported_graphics_clock>375 MHz</supported_graphics_clock>
                                <supported_graphics_clock>360 MHz</supported_graphics_clock>
                                <supported_graphics_clock>345 MHz</supported_graphics_clock>
                                <supported_graphics_clock>330 MHz</supported_graphics_clock>
                                <supported_graphics_clock>315 MHz</supported_graphics_clock>
                                <supported_graphics_clock>300 MHz</supported_graphics_clock>
                                <supported_graphics_clock>285 MHz</supported_graphics_clock>
                                <supported_graphics_clock>270 MHz</supported_graphics_clock>
                                <supported_graphics_clock>255 MHz</supported_graphics_clock>
                                <supported_graphics_clock>240 MHz</supported_graphics_clock>
                                <supported_graphics_clock>225 MHz</supported_graphics_clock>
                                <supported_graphics_clock>210 MHz</supported_graphics_clock>
                        </supported_mem_clock>
                </supported_clocks>
                <processes>
                </processes>
                <accounted_processes>
                </accounted_processes>
        </gpu>

        <gpu id="00000000:81:00.0">
                <product_name>NVIDIA A100-SXM4-80GB</product_name>
                <product_brand>NVIDIA</product_brand>
                <product_architecture>Ampere</product_architecture>
                <display_mode>Enabled</display_mode>
                <display_active>Disabled</display_active>
                <persistence_mode>Disabled</persistence_mode>
                <addressing_mode>None</addressing_mode>
                <mig_mode>
                        <current_mig>Enabled</current_mig>
                        <pending_mig>Enabled</pending_mig>
                </mig_mode>
                <mig_devices>
                        None
                </mig_devices>
                <accounting_mode>Disabled</accounting_mode>
                <accounting_mode_buffer_size>4000</accounting_mode_buffer_size>
                <driver_model>
                        <current_dm>N/A</current_dm>
                        <pending_dm>N/A</pending_dm>
                </driver_model>
                <serial>1650522003966</serial>
                <uuid>GPU-5e375e7e-b41f-155b-dcdc-534524c356fc</uuid>
                <minor_number>3</minor_number>
                <vbios_version>92.00.36.00.02</vbios_version>
                <multigpu_board>No</multigpu_board>
                <board_id>0x8100</board_id>
                <board_part_number>692-2G506-0212-002</board_part_number>
                <gpu_part_number>20B2-895-A1</gpu_part_number>
                <gpu_fru_part_number>N/A</gpu_fru_part_number>
                <gpu_module_id>2</gpu_module_id>
                <inforom_version>
                        <img_version>G506.0212.00.01</img_version>
                        <oem_object>2.0</oem_object>
                        <ecc_object>6.16</ecc_object>
                        <pwr_object>N/A</pwr_object>
                </inforom_version>
                <gpu_operation_mode>
                        <current_gom>N/A</current_gom>
                        <pending_gom>N/A</pending_gom>
                </gpu_operation_mode>
                <gsp_firmware_version>535.54.03</gsp_firmware_version>
                <gpu_virtualization_mode>
                        <virtualization_mode>None</virtualization_mode>
                        <host_vgpu_mode>N/A</host_vgpu_mode>
                </gpu_virtualization_mode>
                <gpu_reset_status>
                        <reset_required>No</reset_required>
                        <drain_and_reset_recommended>No</drain_and_reset_recommended>
                </gpu_reset_status>
                <ibmnpu>
                        <relaxed_ordering_mode>N/A</relaxed_ordering_mode>
                </ibmnpu>
                <pci>
                        <pci_bus>81</pci_bus>
                        <pci_device>00</pci_device>
                        <pci_domain>0000</pci_domain>
                        <pci_device_id>20B210DE</pci_device_id>
                        <pci_bus_id>00000000:81:00.0</pci_bus_id>
                        <pci_sub_system_id>147F10DE</pci_sub_system_id>
                        <pci_gpu_link_info>
                                <pcie_gen>
                                        <max_link_gen>4</max_link_gen>
                                        <current_link_gen>4</current_link_gen>
                                        <device_current_link_gen>4</device_current_link_gen>
                                        <max_device_link_gen>4</max_device_link_gen>
                                        <max_host_link_gen>4</max_host_link_gen>
                                </pcie_gen>
                                <link_widths>
                                        <max_link_width>16x</max_link_width>
                                        <current_link_width>16x</current_link_width>
                                </link_widths>
                        </pci_gpu_link_info>
                        <pci_bridge_chip>
                                <bridge_chip_type>N/A</bridge_chip_type>
                                <bridge_chip_fw>N/A</bridge_chip_fw>
                        </pci_bridge_chip>
                        <replay_counter>0</replay_counter>
                        <replay_rollover_counter>0</replay_rollover_counter>
                        <tx_util>6000 KB/s</tx_util>
                        <rx_util>0 KB/s</rx_util>
                        <atomic_caps_inbound>N/A</atomic_caps_inbound>
                        <atomic_caps_outbound>N/A</atomic_caps_outbound>
                </pci>
                <fan_speed>N/A</fan_speed>
                <performance_state>P0</performance_state>
                <clocks_event_reasons>
                        <clocks_event_reason_gpu_idle>Not Active</clocks_event_reason_gpu_idle>
                        <clocks_event_reason_applications_clocks_setting>Not Active</clocks_event_reason_applications_clocks_setting>
                        <clocks_event_reason_sw_power_cap>Not Active</clocks_event_reason_sw_power_cap>
                        <clocks_event_reason_hw_slowdown>Not Active</clocks_event_reason_hw_slowdown>
                        <clocks_event_reason_hw_thermal_slowdown>Not Active</clocks_event_reason_hw_thermal_slowdown>
                        <clocks_event_reason_hw_power_brake_slowdown>Not Active</clocks_event_reason_hw_power_brake_slowdown>
                        <clocks_event_reason_sync_boost>Not Active</clocks_event_reason_sync_boost>
                        <clocks_event_reason_sw_thermal_slowdown>Not Active</clocks_event_reason_sw_thermal_slowdown>
                        <clocks_event_reason_display_clocks_setting>Not Active</clocks_event_reason_display_clocks_setting>
                </clocks_event_reasons>
                <fb_memory_usage>
                        <total>81920 MiB</total>
                        <reserved>869 MiB</reserved>
                        <used>0 MiB</used>
                        <free>81050 MiB</free>
                </fb_memory_usage>
                <bar1_memory_usage>
                        <total>131072 MiB</total>
                        <used>1 MiB</used>
                        <free>131071 MiB</free>
                </bar1_memory_usage>
                <cc_protected_memory_usage>
                        <total>0 MiB</total>
                        <used>0 MiB</used>
                        <free>0 MiB</free>
                </cc_protected_memory_usage>
                <compute_mode>Default</compute_mode>
                <utilization>
                        <gpu_util>N/A</gpu_util>
                        <memory_util>N/A</memory_util>
                        <encoder_util>N/A</encoder_util>
                        <decoder_util>N/A</decoder_util>
                        <jpeg_util>N/A</jpeg_util>
                        <ofa_util>N/A</ofa_util>
                </utilization>
                <encoder_stats>
                        <session_count>0</session_count>
                        <average_fps>0</average_fps>
                        <average_latency>0</average_latency>
                </encoder_stats>
                <fbc_stats>
                        <session_count>0</session_count>
                        <average_fps>0</average_fps>
                        <average_latency>0</average_latency>
                </fbc_stats>
                <ecc_mode>
                        <current_ecc>Enabled</current_ecc>
                        <pending_ecc>Enabled</pending_ecc>
                </ecc_mode>
                <ecc_errors>
                        <volatile>
                                <sram_correctable>0</sram_correctable>
                                <sram_uncorrectable>0</sram_uncorrectable>
                                <dram_correctable>0</dram_correctable>
                                <dram_uncorrectable>0</dram_uncorrectable>
                        </volatile>
                        <aggregate>
                                <sram_correctable>0</sram_correctable>
                                <sram_uncorrectable>0</sram_uncorrectable>
                                <dram_correctable>0</dram_correctable>
                                <dram_uncorrectable>0</dram_uncorrectable>
                        </aggregate>
                </ecc_errors>
                <retired_pages>
                        <multiple_single_bit_retirement>
                                <retired_count>N/A</retired_count>
                                <retired_pagelist>N/A</retired_pagelist>
                        </multiple_single_bit_retirement>
                        <double_bit_retirement>
                                <retired_count>N/A</retired_count>
                                <retired_pagelist>N/A</retired_pagelist>
                        </double_bit_retirement>
                        <pending_blacklist>N/A</pending_blacklist>
                        <pending_retirement>N/A</pending_retirement>
                </retired_pages>
                <remapped_rows>
                        <remapped_row_corr>0</remapped_row_corr>
                        <remapped_row_unc>0</remapped_row_unc>
                        <remapped_row_pending>No</remapped_row_pending>
                        <remapped_row_failure>No</remapped_row_failure>
                        <row_remapper_histogram>
                                <row_remapper_histogram_max>640 bank(s)</row_remapper_histogram_max>
                                <row_remapper_histogram_high>0 bank(s)</row_remapper_histogram_high>
                                <row_remapper_histogram_partial>0 bank(s)</row_remapper_histogram_partial>
                                <row_remapper_histogram_low>0 bank(s)</row_remapper_histogram_low>
                                <row_remapper_histogram_none>0 bank(s)</row_remapper_histogram_none>
                        </row_remapper_histogram>
                </remapped_rows>
                <temperature>
                        <gpu_temp>25 C</gpu_temp>
                        <gpu_temp_tlimit>N/A</gpu_temp_tlimit>
                        <gpu_temp_max_threshold>92 C</gpu_temp_max_threshold>
                        <gpu_temp_slow_threshold>89 C</gpu_temp_slow_threshold>
                        <gpu_temp_max_gpu_threshold>85 C</gpu_temp_max_gpu_threshold>
                        <gpu_target_temperature>N/A</gpu_target_temperature>
                        <memory_temp>42 C</memory_temp>
                        <gpu_temp_max_mem_threshold>95 C</gpu_temp_max_mem_threshold>
                </temperature>
                <supported_gpu_target_temp>
                        <gpu_target_temp_min>N/A</gpu_target_temp_min>
                        <gpu_target_temp_max>N/A</gpu_target_temp_max>
                </supported_gpu_target_temp>
                <gpu_power_readings>
                        <power_state>P0</power_state>
                        <power_draw>64.60 W</power_draw>
                        <current_power_limit>500.00 W</current_power_limit>
                        <requested_power_limit>500.00 W</requested_power_limit>
                        <default_power_limit>500.00 W</default_power_limit>
                        <min_power_limit>100.00 W</min_power_limit>
                        <max_power_limit>500.00 W</max_power_limit>
                </gpu_power_readings>
                <module_power_readings>
                        <power_state>P0</power_state>
                        <power_draw>N/A</power_draw>
                        <current_power_limit>N/A</current_power_limit>
                        <requested_power_limit>N/A</requested_power_limit>
                        <default_power_limit>N/A</default_power_limit>
                        <min_power_limit>N/A</min_power_limit>
                        <max_power_limit>N/A</max_power_limit>
                </module_power_readings>
                <clocks>
                        <graphics_clock>1275 MHz</graphics_clock>
                        <sm_clock>1275 MHz</sm_clock>
                        <mem_clock>1593 MHz</mem_clock>
                        <video_clock>1155 MHz</video_clock>
                </clocks>
                <applications_clocks>
                        <graphics_clock>1275 MHz</graphics_clock>
                        <mem_clock>1593 MHz</mem_clock>
                </applications_clocks>
                <default_applications_clocks>
                        <graphics_clock>1275 MHz</graphics_clock>
                        <mem_clock>1593 MHz</mem_clock>
                </default_applications_clocks>
                <deferred_clocks>
                        <mem_clock>N/A</mem_clock>
                </deferred_clocks>
                <max_clocks>
                        <graphics_clock>1410 MHz</graphics_clock>
                        <sm_clock>1410 MHz</sm_clock>
                        <mem_clock>1593 MHz</mem_clock>
                        <video_clock>1290 MHz</video_clock>
                </max_clocks>
                <max_customer_boost_clocks>
                        <graphics_clock>1410 MHz</graphics_clock>
                </max_customer_boost_clocks>
                <clock_policy>
                        <auto_boost>N/A</auto_boost>
                        <auto_boost_default>N/A</auto_boost_default>
                </clock_policy>
                <voltage>
                        <graphics_volt>818.750 mV</graphics_volt>
                </voltage>
                <fabric>
                        <state>N/A</state>
                        <status>N/A</status>
                </fabric>
                <supported_clocks>
                        <supported_mem_clock>
                                <value>1593 MHz</value>
                                <supported_graphics_clock>1410 MHz</supported_graphics_clock>
                                <supported_graphics_clock>1395 MHz</supported_graphics_clock>
                                <supported_graphics_clock>1380 MHz</supported_graphics_clock>
                                <supported_graphics_clock>1365 MHz</supported_graphics_clock>
                                <supported_graphics_clock>1350 MHz</supported_graphics_clock>
                                <supported_graphics_clock>1335 MHz</supported_graphics_clock>
                                <supported_graphics_clock>1320 MHz</supported_graphics_clock>
                                <supported_graphics_clock>1305 MHz</supported_graphics_clock>
                                <supported_graphics_clock>1290 MHz</supported_graphics_clock>
                                <supported_graphics_clock>1275 MHz</supported_graphics_clock>
                                <supported_graphics_clock>1260 MHz</supported_graphics_clock>
                                <supported_graphics_clock>1245 MHz</supported_graphics_clock>
                                <supported_graphics_clock>1230 MHz</supported_graphics_clock>
                                <supported_graphics_clock>1215 MHz</supported_graphics_clock>
                                <supported_graphics_clock>1200 MHz</supported_graphics_clock>
                                <supported_graphics_clock>1185 MHz</supported_graphics_clock>
                                <supported_graphics_clock>1170 MHz</supported_graphics_clock>
                                <supported_graphics_clock>1155 MHz</supported_graphics_clock>
                                <supported_graphics_clock>1140 MHz</supported_graphics_clock>
                                <supported_graphics_clock>1125 MHz</supported_graphics_clock>
                                <supported_graphics_clock>1110 MHz</supported_graphics_clock>
                                <supported_graphics_clock>1095 MHz</supported_graphics_clock>
                                <supported_graphics_clock>1080 MHz</supported_graphics_clock>
                                <supported_graphics_clock>1065 MHz</supported_graphics_clock>
                                <supported_graphics_clock>1050 MHz</supported_graphics_clock>
                                <supported_graphics_clock>1035 MHz</supported_graphics_clock>
                                <supported_graphics_clock>1020 MHz</supported_graphics_clock>
                                <supported_graphics_clock>1005 MHz</supported_graphics_clock>
                                <supported_graphics_clock>990 MHz</supported_graphics_clock>
                                <supported_graphics_clock>975 MHz</supported_graphics_clock>
                                <supported_graphics_clock>960 MHz</supported_graphics_clock>
                                <supported_graphics_clock>945 MHz</supported_graphics_clock>
                                <supported_graphics_clock>930 MHz</supported_graphics_clock>
                                <supported_graphics_clock>915 MHz</supported_graphics_clock>
                                <supported_graphics_clock>900 MHz</supported_graphics_clock>
                                <supported_graphics_clock>885 MHz</supported_graphics_clock>
                                <supported_graphics_clock>870 MHz</supported_graphics_clock>
                                <supported_graphics_clock>855 MHz</supported_graphics_clock>
                                <supported_graphics_clock>840 MHz</supported_graphics_clock>
                                <supported_graphics_clock>825 MHz</supported_graphics_clock>
                                <supported_graphics_clock>810 MHz</supported_graphics_clock>
                                <supported_graphics_clock>795 MHz</supported_graphics_clock>
                                <supported_graphics_clock>780 MHz</supported_graphics_clock>
                                <supported_graphics_clock>765 MHz</supported_graphics_clock>
                                <supported_graphics_clock>750 MHz</supported_graphics_clock>
                                <supported_graphics_clock>735 MHz</supported_graphics_clock>
                                <supported_graphics_clock>720 MHz</supported_graphics_clock>
                                <supported_graphics_clock>705 MHz</supported_graphics_clock>
                                <supported_graphics_clock>690 MHz</supported_graphics_clock>
                                <supported_graphics_clock>675 MHz</supported_graphics_clock>
                                <supported_graphics_clock>660 MHz</supported_graphics_clock>
                                <supported_graphics_clock>645 MHz</supported_graphics_clock>
                                <supported_graphics_clock>630 MHz</supported_graphics_clock>
                                <supported_graphics_clock>615 MHz</supported_graphics_clock>
                                <supported_graphics_clock>600 MHz</supported_graphics_clock>
                                <supported_graphics_clock>585 MHz</supported_graphics_clock>
                                <supported_graphics_clock>570 MHz</supported_graphics_clock>
                                <supported_graphics_clock>555 MHz</supported_graphics_clock>
                                <supported_graphics_clock>540 MHz</supported_graphics_clock>
                                <supported_graphics_clock>525 MHz</supported_graphics_clock>
                                <supported_graphics_clock>510 MHz</supported_graphics_clock>
                                <supported_graphics_clock>495 MHz</supported_graphics_clock>
                                <supported_graphics_clock>480 MHz</supported_graphics_clock>
                                <supported_graphics_clock>465 MHz</supported_graphics_clock>
                                <supported_graphics_clock>450 MHz</supported_graphics_clock>
                                <supported_graphics_clock>435 MHz</supported_graphics_clock>
                                <supported_graphics_clock>420 MHz</supported_graphics_clock>
                                <supported_graphics_clock>405 MHz</supported_graphics_clock>
                                <supported_graphics_clock>390 MHz</supported_graphics_clock>
                                <supported_graphics_clock>375 MHz</supported_graphics_clock>
                                <supported_graphics_clock>360 MHz</supported_graphics_clock>
                                <supported_graphics_clock>345 MHz</supported_graphics_clock>
                                <supported_graphics_clock>330 MHz</supported_graphics_clock>
                                <supported_graphics_clock>315 MHz</supported_graphics_clock>
                                <supported_graphics_clock>300 MHz</supported_graphics_clock>
                                <supported_graphics_clock>285 MHz</supported_graphics_clock>
                                <supported_graphics_clock>270 MHz</supported_graphics_clock>
                                <supported_graphics_clock>255 MHz</supported_graphics_clock>
                                <supported_graphics_clock>240 MHz</supported_graphics_clock>
                                <supported_graphics_clock>225 MHz</supported_graphics_clock>
                                <supported_graphics_clock>210 MHz</supported_graphics_clock>
                        </supported_mem_clock>
                </supported_clocks>
                <processes>
                </processes>
                <accounted_processes>
                </accounted_processes>
        </gpu>

        <gpu id="00000000:C1:00.0">
                <product_name>NVIDIA A100-SXM4-80GB</product_name>
                <product_brand>NVIDIA</product_brand>
                <product_architecture>Ampere</product_architecture>
                <display_mode>Enabled</display_mode>
                <display_active>Disabled</display_active>
                <persistence_mode>Disabled</persistence_mode>
                <addressing_mode>None</addressing_mode>
                <mig_mode>
                        <current_mig>Enabled</current_mig>
                        <pending_mig>Enabled</pending_mig>
                </mig_mode>
                <mig_devices>
                        None
                </mig_devices>
                <accounting_mode>Disabled</accounting_mode>
                <accounting_mode_buffer_size>4000</accounting_mode_buffer_size>
                <driver_model>
                        <current_dm>N/A</current_dm>
                        <pending_dm>N/A</pending_dm>
                </driver_model>
                <serial>1650522003668</serial>
                <uuid>GPU-69ec2344-ebac-b028-f88f-452abb3d7f11</uuid>
                <minor_number>2</minor_number>
                <vbios_version>92.00.36.00.02</vbios_version>
                <multigpu_board>No</multigpu_board>
                <board_id>0xc100</board_id>
                <board_part_number>692-2G506-0212-002</board_part_number>
                <gpu_part_number>20B2-895-A1</gpu_part_number>
                <gpu_fru_part_number>N/A</gpu_fru_part_number>
                <gpu_module_id>1</gpu_module_id>
                <inforom_version>
                        <img_version>G506.0212.00.01</img_version>
                        <oem_object>2.0</oem_object>
                        <ecc_object>6.16</ecc_object>
                        <pwr_object>N/A</pwr_object>
                </inforom_version>
                <gpu_operation_mode>
                        <current_gom>N/A</current_gom>
                        <pending_gom>N/A</pending_gom>
                </gpu_operation_mode>
                <gsp_firmware_version>535.54.03</gsp_firmware_version>
                <gpu_virtualization_mode>
                        <virtualization_mode>None</virtualization_mode>
                        <host_vgpu_mode>N/A</host_vgpu_mode>
                </gpu_virtualization_mode>
                <gpu_reset_status>
                        <reset_required>No</reset_required>
                        <drain_and_reset_recommended>No</drain_and_reset_recommended>
                </gpu_reset_status>
                <ibmnpu>
                        <relaxed_ordering_mode>N/A</relaxed_ordering_mode>
                </ibmnpu>
                <pci>
                        <pci_bus>C1</pci_bus>
                        <pci_device>00</pci_device>
                        <pci_domain>0000</pci_domain>
                        <pci_device_id>20B210DE</pci_device_id>
                        <pci_bus_id>00000000:C1:00.0</pci_bus_id>
                        <pci_sub_system_id>147F10DE</pci_sub_system_id>
                        <pci_gpu_link_info>
                                <pcie_gen>
                                        <max_link_gen>4</max_link_gen>
                                        <current_link_gen>4</current_link_gen>
                                        <device_current_link_gen>4</device_current_link_gen>
                                        <max_device_link_gen>4</max_device_link_gen>
                                        <max_host_link_gen>4</max_host_link_gen>
                                </pcie_gen>
                                <link_widths>
                                        <max_link_width>16x</max_link_width>
                                        <current_link_width>16x</current_link_width>
                                </link_widths>
                        </pci_gpu_link_info>
                        <pci_bridge_chip>
                                <bridge_chip_type>N/A</bridge_chip_type>
                                <bridge_chip_fw>N/A</bridge_chip_fw>
                        </pci_bridge_chip>
                        <replay_counter>0</replay_counter>
                        <replay_rollover_counter>0</replay_rollover_counter>
                        <tx_util>7000 KB/s</tx_util>
                        <rx_util>0 KB/s</rx_util>
                        <atomic_caps_inbound>N/A</atomic_caps_inbound>
                        <atomic_caps_outbound>N/A</atomic_caps_outbound>
                </pci>
                <fan_speed>N/A</fan_speed>
                <performance_state>P0</performance_state>
                <clocks_event_reasons>
                        <clocks_event_reason_gpu_idle>Not Active</clocks_event_reason_gpu_idle>
                        <clocks_event_reason_applications_clocks_setting>Not Active</clocks_event_reason_applications_clocks_setting>
                        <clocks_event_reason_sw_power_cap>Not Active</clocks_event_reason_sw_power_cap>
                        <clocks_event_reason_hw_slowdown>Not Active</clocks_event_reason_hw_slowdown>
                        <clocks_event_reason_hw_thermal_slowdown>Not Active</clocks_event_reason_hw_thermal_slowdown>
                        <clocks_event_reason_hw_power_brake_slowdown>Not Active</clocks_event_reason_hw_power_brake_slowdown>
                        <clocks_event_reason_sync_boost>Not Active</clocks_event_reason_sync_boost>
                        <clocks_event_reason_sw_thermal_slowdown>Not Active</clocks_event_reason_sw_thermal_slowdown>
                        <clocks_event_reason_display_clocks_setting>Not Active</clocks_event_reason_display_clocks_setting>
                </clocks_event_reasons>
                <fb_memory_usage>
                        <total>81920 MiB</total>
                        <reserved>869 MiB</reserved>
                        <used>0 MiB</used>
                        <free>81050 MiB</free>
                </fb_memory_usage>
                <bar1_memory_usage>
                        <total>131072 MiB</total>
                        <used>1 MiB</used>
                        <free>131071 MiB</free>
                </bar1_memory_usage>
                <cc_protected_memory_usage>
                        <total>0 MiB</total>
                        <used>0 MiB</used>
                        <free>0 MiB</free>
                </cc_protected_memory_usage>
                <compute_mode>Default</compute_mode>
                <utilization>
                        <gpu_util>N/A</gpu_util>
                        <memory_util>N/A</memory_util>
                        <encoder_util>N/A</encoder_util>
                        <decoder_util>N/A</decoder_util>
                        <jpeg_util>N/A</jpeg_util>
                        <ofa_util>N/A</ofa_util>
                </utilization>
                <encoder_stats>
                        <session_count>0</session_count>
                        <average_fps>0</average_fps>
                        <average_latency>0</average_latency>
                </encoder_stats>
                <fbc_stats>
                        <session_count>0</session_count>
                        <average_fps>0</average_fps>
                        <average_latency>0</average_latency>
                </fbc_stats>
                <ecc_mode>
                        <current_ecc>Enabled</current_ecc>
                        <pending_ecc>Enabled</pending_ecc>
                </ecc_mode>
                <ecc_errors>
                        <volatile>
                                <sram_correctable>0</sram_correctable>
                                <sram_uncorrectable>0</sram_uncorrectable>
                                <dram_correctable>0</dram_correctable>
                                <dram_uncorrectable>0</dram_uncorrectable>
                        </volatile>
                        <aggregate>
                                <sram_correctable>0</sram_correctable>
                                <sram_uncorrectable>0</sram_uncorrectable>
                                <dram_correctable>0</dram_correctable>
                                <dram_uncorrectable>0</dram_uncorrectable>
                        </aggregate>
                </ecc_errors>
                <retired_pages>
                        <multiple_single_bit_retirement>
                                <retired_count>N/A</retired_count>
                                <retired_pagelist>N/A</retired_pagelist>
                        </multiple_single_bit_retirement>
                        <double_bit_retirement>
                                <retired_count>N/A</retired_count>
                                <retired_pagelist>N/A</retired_pagelist>
                        </double_bit_retirement>
                        <pending_blacklist>N/A</pending_blacklist>
                        <pending_retirement>N/A</pending_retirement>
                </retired_pages>
                <remapped_rows>
                        <remapped_row_corr>0</remapped_row_corr>
                        <remapped_row_unc>0</remapped_row_unc>
                        <remapped_row_pending>No</remapped_row_pending>
                        <remapped_row_failure>No</remapped_row_failure>
                        <row_remapper_histogram>
                                <row_remapper_histogram_max>640 bank(s)</row_remapper_histogram_max>
                                <row_remapper_histogram_high>0 bank(s)</row_remapper_histogram_high>
                                <row_remapper_histogram_partial>0 bank(s)</row_remapper_histogram_partial>
                                <row_remapper_histogram_low>0 bank(s)</row_remapper_histogram_low>
                                <row_remapper_histogram_none>0 bank(s)</row_remapper_histogram_none>
                        </row_remapper_histogram>
                </remapped_rows>
                <temperature>
                        <gpu_temp>25 C</gpu_temp>
                        <gpu_temp_tlimit>N/A</gpu_temp_tlimit>
                        <gpu_temp_max_threshold>92 C</gpu_temp_max_threshold>
                        <gpu_temp_slow_threshold>89 C</gpu_temp_slow_threshold>
                        <gpu_temp_max_gpu_threshold>85 C</gpu_temp_max_gpu_threshold>
                        <gpu_target_temperature>N/A</gpu_target_temperature>
                        <memory_temp>43 C</memory_temp>
                        <gpu_temp_max_mem_threshold>95 C</gpu_temp_max_mem_threshold>
                </temperature>
                <supported_gpu_target_temp>
                        <gpu_target_temp_min>N/A</gpu_target_temp_min>
                        <gpu_target_temp_max>N/A</gpu_target_temp_max>
                </supported_gpu_target_temp>
                <gpu_power_readings>
                        <power_state>P0</power_state>
                        <power_draw>60.97 W</power_draw>
                        <current_power_limit>500.00 W</current_power_limit>
                        <requested_power_limit>500.00 W</requested_power_limit>
                        <default_power_limit>500.00 W</default_power_limit>
                        <min_power_limit>100.00 W</min_power_limit>
                        <max_power_limit>500.00 W</max_power_limit>
                </gpu_power_readings>
                <module_power_readings>
                        <power_state>P0</power_state>
                        <power_draw>N/A</power_draw>
                        <current_power_limit>N/A</current_power_limit>
                        <requested_power_limit>N/A</requested_power_limit>
                        <default_power_limit>N/A</default_power_limit>
                        <min_power_limit>N/A</min_power_limit>
                        <max_power_limit>N/A</max_power_limit>
                </module_power_readings>
                <clocks>
                        <graphics_clock>1275 MHz</graphics_clock>
                        <sm_clock>1275 MHz</sm_clock>
                        <mem_clock>1593 MHz</mem_clock>
                        <video_clock>1155 MHz</video_clock>
                </clocks>
                <applications_clocks>
                        <graphics_clock>1275 MHz</graphics_clock>
                        <mem_clock>1593 MHz</mem_clock>
                </applications_clocks>
                <default_applications_clocks>
                        <graphics_clock>1275 MHz</graphics_clock>
                        <mem_clock>1593 MHz</mem_clock>
                </default_applications_clocks>
                <deferred_clocks>
                        <mem_clock>N/A</mem_clock>
                </deferred_clocks>
                <max_clocks>
                        <graphics_clock>1410 MHz</graphics_clock>
                        <sm_clock>1410 MHz</sm_clock>
                        <mem_clock>1593 MHz</mem_clock>
                        <video_clock>1290 MHz</video_clock>
                </max_clocks>
                <max_customer_boost_clocks>
                        <graphics_clock>1410 MHz</graphics_clock>
                </max_customer_boost_clocks>
                <clock_policy>
                        <auto_boost>N/A</auto_boost>
                        <auto_boost_default>N/A</auto_boost_default>
                </clock_policy>
                <voltage>
                        <graphics_volt>843.750 mV</graphics_volt>
                </voltage>
                <fabric>
                        <state>N/A</state>
                        <status>N/A</status>
                </fabric>
                <supported_clocks>
                        <supported_mem_clock>
                                <value>1593 MHz</value>
                                <supported_graphics_clock>1410 MHz</supported_graphics_clock>
                                <supported_graphics_clock>1395 MHz</supported_graphics_clock>
                                <supported_graphics_clock>1380 MHz</supported_graphics_clock>
                                <supported_graphics_clock>1365 MHz</supported_graphics_clock>
                                <supported_graphics_clock>1350 MHz</supported_graphics_clock>
                                <supported_graphics_clock>1335 MHz</supported_graphics_clock>
                                <supported_graphics_clock>1320 MHz</supported_graphics_clock>
                                <supported_graphics_clock>1305 MHz</supported_graphics_clock>
                                <supported_graphics_clock>1290 MHz</supported_graphics_clock>
                                <supported_graphics_clock>1275 MHz</supported_graphics_clock>
                                <supported_graphics_clock>1260 MHz</supported_graphics_clock>
                                <supported_graphics_clock>1245 MHz</supported_graphics_clock>
                                <supported_graphics_clock>1230 MHz</supported_graphics_clock>
                                <supported_graphics_clock>1215 MHz</supported_graphics_clock>
                                <supported_graphics_clock>1200 MHz</supported_graphics_clock>
                                <supported_graphics_clock>1185 MHz</supported_graphics_clock>
                                <supported_graphics_clock>1170 MHz</supported_graphics_clock>
                                <supported_graphics_clock>1155 MHz</supported_graphics_clock>
                                <supported_graphics_clock>1140 MHz</supported_graphics_clock>
                                <supported_graphics_clock>1125 MHz</supported_graphics_clock>
                                <supported_graphics_clock>1110 MHz</supported_graphics_clock>
                                <supported_graphics_clock>1095 MHz</supported_graphics_clock>
                                <supported_graphics_clock>1080 MHz</supported_graphics_clock>
                                <supported_graphics_clock>1065 MHz</supported_graphics_clock>
                                <supported_graphics_clock>1050 MHz</supported_graphics_clock>
                                <supported_graphics_clock>1035 MHz</supported_graphics_clock>
                                <supported_graphics_clock>1020 MHz</supported_graphics_clock>
                                <supported_graphics_clock>1005 MHz</supported_graphics_clock>
                                <supported_graphics_clock>990 MHz</supported_graphics_clock>
                                <supported_graphics_clock>975 MHz</supported_graphics_clock>
                                <supported_graphics_clock>960 MHz</supported_graphics_clock>
                                <supported_graphics_clock>945 MHz</supported_graphics_clock>
                                <supported_graphics_clock>930 MHz</supported_graphics_clock>
                                <supported_graphics_clock>915 MHz</supported_graphics_clock>
                                <supported_graphics_clock>900 MHz</supported_graphics_clock>
                                <supported_graphics_clock>885 MHz</supported_graphics_clock>
                                <supported_graphics_clock>870 MHz</supported_graphics_clock>
                                <supported_graphics_clock>855 MHz</supported_graphics_clock>
                                <supported_graphics_clock>840 MHz</supported_graphics_clock>
                                <supported_graphics_clock>825 MHz</supported_graphics_clock>
                                <supported_graphics_clock>810 MHz</supported_graphics_clock>
                                <supported_graphics_clock>795 MHz</supported_graphics_clock>
                                <supported_graphics_clock>780 MHz</supported_graphics_clock>
                                <supported_graphics_clock>765 MHz</supported_graphics_clock>
                                <supported_graphics_clock>750 MHz</supported_graphics_clock>
                                <supported_graphics_clock>735 MHz</supported_graphics_clock>
                                <supported_graphics_clock>720 MHz</supported_graphics_clock>
                                <supported_graphics_clock>705 MHz</supported_graphics_clock>
                                <supported_graphics_clock>690 MHz</supported_graphics_clock>
                                <supported_graphics_clock>675 MHz</supported_graphics_clock>
                                <supported_graphics_clock>660 MHz</supported_graphics_clock>
                                <supported_graphics_clock>645 MHz</supported_graphics_clock>
                                <supported_graphics_clock>630 MHz</supported_graphics_clock>
                                <supported_graphics_clock>615 MHz</supported_graphics_clock>
                                <supported_graphics_clock>600 MHz</supported_graphics_clock>
                                <supported_graphics_clock>585 MHz</supported_graphics_clock>
                                <supported_graphics_clock>570 MHz</supported_graphics_clock>
                                <supported_graphics_clock>555 MHz</supported_graphics_clock>
                                <supported_graphics_clock>540 MHz</supported_graphics_clock>
                                <supported_graphics_clock>525 MHz</supported_graphics_clock>
                                <supported_graphics_clock>510 MHz</supported_graphics_clock>
                                <supported_graphics_clock>495 MHz</supported_graphics_clock>
                                <supported_graphics_clock>480 MHz</supported_graphics_clock>
                                <supported_graphics_clock>465 MHz</supported_graphics_clock>
                                <supported_graphics_clock>450 MHz</supported_graphics_clock>
                                <supported_graphics_clock>435 MHz</supported_graphics_clock>
                                <supported_graphics_clock>420 MHz</supported_graphics_clock>
                                <supported_graphics_clock>405 MHz</supported_graphics_clock>
                                <supported_graphics_clock>390 MHz</supported_graphics_clock>
                                <supported_graphics_clock>375 MHz</supported_graphics_clock>
                                <supported_graphics_clock>360 MHz</supported_graphics_clock>
                                <supported_graphics_clock>345 MHz</supported_graphics_clock>
                                <supported_graphics_clock>330 MHz</supported_graphics_clock>
                                <supported_graphics_clock>315 MHz</supported_graphics_clock>
                                <supported_graphics_clock>300 MHz</supported_graphics_clock>
                                <supported_graphics_clock>285 MHz</supported_graphics_clock>
                                <supported_graphics_clock>270 MHz</supported_graphics_clock>
                                <supported_graphics_clock>255 MHz</supported_graphics_clock>
                                <supported_graphics_clock>240 MHz</supported_graphics_clock>
                                <supported_graphics_clock>225 MHz</supported_graphics_clock>
                                <supported_graphics_clock>210 MHz</supported_graphics_clock>
                        </supported_mem_clock>
                </supported_clocks>
                <processes>
                </processes>
                <accounted_processes>
                </accounted_processes>
        </gpu>
</nvidia_smi_log>
powersj commented 1 year ago

@carlos-encs yes thank you - as a follow up question, what metrics are you interested in collecting out of that output? Others in this thread have talked about utilization, but those metrics are not there.

carlos-encs commented 1 year ago

@powersj I'm interested in knowing the MIGs usage (Memory and bar1) per GPU and GPUx usage as a whole Thanks

powersj commented 1 year ago

@carlos-encs - can you try the artifacts in #13733 please?

Thanks

carlos-encs commented 1 year ago

@powersj I replaced my telegraf exec file with your exec file (telegraf-1.28.0/usr/bin/telegraf) and started the instance, it doesn't show any error. Metrics are sent to influxdb, but no MIG information only GPU.

powersj commented 1 year ago

@carlos-encs please provide the logs and output the metrics using [[outputs.file]]

carlos-encs commented 1 year ago

@powersj output is there metrics-out.txt

Log :

2023-08-09T15:22:00Z I! Loading config: /etc/telegraf/telegraf.conf 2023-08-09T15:22:00Z I! Starting Telegraf 1.28.0-03a43852 2023-08-09T15:22:00Z I! Available plugins: 238 inputs, 9 aggregators, 28 processors, 24 parsers, 59 outputs, 5 secret-stores 2023-08-09T15:22:00Z I! Loaded inputs: cpu disk diskio kernel mem net netstat nvidia_smi processes system 2023-08-09T15:22:00Z I! Loaded aggregators: 2023-08-09T15:22:00Z I! Loaded processors: 2023-08-09T15:22:00Z I! Loaded secretstores: 2023-08-09T15:22:00Z I! Loaded outputs: file influxdb 2023-08-09T15:22:00Z I! Tags enabled: host=speed-41 2023-08-09T15:22:00Z I! [agent] Config: Interval:20s, Quiet:false, Hostname:"speed-41", Flush Interval:10s 2023-08-09T15:22:00Z W! DeprecationWarning: Value "false" for option "ignore_protocol_stats" of plugin "inputs.net" deprecated since version 1.27.3 and will be removed in 1.36.0: use the 'inputs.nstat' plugin instead

INFO: instance started successfully

powersj commented 1 year ago

@carlos-encs is the output of nvidia -q -x any different on where you are running this? I used your own XML file as a test case.

carlos-encs commented 1 year ago

@powersj all the servers have the same config, the only difference it's the UUID of the GPUs, do you think it can affect the results? Just in case, I will run the test in the original server, tomorrow

powersj commented 1 year ago

no it should not

what I see from your logs is no nvidia_smi_mig metrics. The only way that metric is generated is if there are gpu.MigDevices.MigDevice listed in the XML and then if the values for FbMemoryUsage or Bar1MemoryUsage or SramUncorrectable are present. Also note I only added this to the v12 schema since that is what you provided me with.

I've added some additional debug info to try to help us out. There will be new artifacts in 20-30mins.

powersj commented 1 year ago

@carlos-encs,

Have you had a chance to try the PR on the original server?

carlos-encs commented 1 year ago

@powersj the new output file is here: metrics-out2.txt

powersj commented 1 year ago

What changed since they seem to show up now:

nvidia_smi_mig,arch=Ampere,compute_index=0,compute_mode=Default,gpu_index=3,host=speed-37,index=0,name=NVIDIA\ A100-SXM4-80GB,pstate=P0,uuid=GPU-513536b6-7d19-9063-b049-1e69664bb298 memory_bar1_free=32767i,sram_uncorrectable=0i,memory_fb_total=19968i,memory_fb_reserved=0i,memory_fb_used=12i,memory_fb_free=19955i,memory_bar1_total=32767i,memory_bar1_used=0i 1692211073000000000
nvidia_smi_mig,arch=Ampere,compute_index=0,compute_mode=Default,gpu_index=4,host=speed-37,index=1,name=NVIDIA\ A100-SXM4-80GB,pstate=P0,uuid=GPU-513536b6-7d19-9063-b049-1e69664bb298 memory_fb_total=19968i,memory_fb_reserved=0i,memory_fb_used=12i,memory_fb_free=19955i,memory_bar1_total=32767i,memory_bar1_used=0i,memory_bar1_free=32767i,sram_uncorrectable=0i 1692211073000000000
nvidia_smi_mig,arch=Ampere,compute_index=0,compute_mode=Default,gpu_index=5,host=speed-37,index=2,name=NVIDIA\ A100-SXM4-80GB,pstate=P0,uuid=GPU-513536b6-7d19-9063-b049-1e69664bb298 memory_bar1_used=0i,memory_bar1_free=32767i,sram_uncorrectable=0i,memory_fb_total=19968i,memory_fb_reserved=0i,memory_fb_used=12i,memory_fb_free=19955i,memory_bar1_total=32767i 1692211073000000000
nvidia_smi_mig,arch=Ampere,compute_index=0,compute_mode=Default,gpu_index=6,host=speed-37,index=3,name=NVIDIA\ A100-SXM4-80GB,pstate=P0,uuid=GPU-513536b6-7d19-9063-b049-1e69664bb298 memory_bar1_total=32767i,memory_bar1_used=0i,memory_bar1_free=32767i,sram_uncorrectable=0i,memory_fb_total=19968i,memory_fb_reserved=0i,memory_fb_used=12i,memory_fb_free=19955i 1692211073000000000

Do these metrics look ideal?

carlos-encs commented 1 year ago

@powersj the metrics look fine.

Thanks