gpuopenanalytics / pynvml

Provide Python access to the NVML library for GPU diagnostics
BSD 3-Clause "New" or "Revised" License
212 stars 31 forks source link

What information is valuable? #3

Closed mrocklin closed 3 years ago

mrocklin commented 5 years ago

Here is what the nvidia-smi API in this library produces for a single GPU. What of this information is useful and would we maybe want to make dashboards out of?

In [1]: import nvidia_smi

In [2]: nvsmi = nvidia_smi.nvidia_smi.getInstance()

In [3]: nvsmi.DeviceQuery()['gpu'][0]
Out[3]:
{'id': '0000:06:00.0',
 'product_name': 'Tesla V100-SXM2-32GB',
 'product_brand': 'Tesla',
 'display_mode': 'Enabled',
 'display_active': 'Disabled',
 'persistence_mode': 'Enabled',
 'accounting_mode': 'Disabled',
 'accounting_mode_buffer_size': '4000',
 'driver_model': {'current_dm': 'N/A', 'pending_dm': 'N/A'},
 'serial': '0321918171737',
 'uuid': 'GPU-96ab329d-7a1f-73a8-a9b7-18b4b2855f92',
 'minor_number': '0',
 'vbios_version': '88.00.43.00.04',
 'multigpu_board': 'No',
 'board_id': '0x600',
 'inforom_version': {'img_version': 'G503.0204.00.02',
  'oem_object': '1.1',
  'ecc_object': '5.0',
  'pwr_object': 'N/A'},
 'gpu_operation_mode': {'current_gom': 'N/A', 'pending_gom': 'N/A'},
 'pci': {'pci_bus': '06',
  'pci_device': '00',
  'pci_domain': '0000',
  'pci_device_id': '1DB510DE',
  'pci_bus_id': '0000:06:00.0',
  'pci_sub_system_id': '124910DE',
  'pci_gpu_link_info': {'pcie_gen': {'max_link_gen': '3',
    'current_link_gen': '3'},
   'link_widths': {'max_link_width': '16x', 'current_link_width': '16x'}},
  'pci_bridge_chip': {'bridge_chip_type': 'N/A', 'bridge_chip_fw': 'N/A'},
  'replay_counter': '0',
  'tx_util': 0,
  'tx_util_unit': 'KB/s',
  'rx_util': 0,
  'rx_util_unit': 'KB/s'},
 'fan_speed': 'N/A',
 'fan_speed_unit': '%',
 'performance_state': 'P0',
 'clocks_throttle': {'clocks_throttle_reason_gpu_idle': 'Active',
  'clocks_throttle_reason_applications_clocks_setting': 'Not Active',
  'clocks_throttle_reason_sw_power_cap': 'Not Active',
  'clocks_throttle_reason_hw_slowdown': 'Not Active',
  'clocks_throttle_reason_unknown': 'N/A'},
 'fb_memory_usage': {'total': 32510.5,
  'used': 0.0,
  'free': 32510.5,
  'unit': 'MiB'},
 'bar1_memory_usage': {'total': 32768.0,
  'used': 2.50390625,
  'free': 32765.49609375,
  'unit': 'MiB'},
 'compute_mode': 'Default',
 'utilization': {'gpu_util': 0,
  'memory_util': 0,
  'encoder_util': 0,
  'decoder_util': 0,
  'unit': '%'},
 'ecc_mode': {'current_ecc': 'Enabled', 'pending_ecc': 'Enabled'},
 'ecc_errors': {'volatile': {'single_bit': {'device_memory': 0,
    'register_file': 0,
    'l1_cache': 0,
    'l2_cache': 0,
    'texture_memory': 'N/A',
    'total': '0'},
   'double_bit': {'device_memory': 0,
    'register_file': 0,
    'l1_cache': 0,
    'l2_cache': 0,
    'texture_memory': 'N/A',
    'total': '0'}},
  'aggregate': {'single_bit': {'device_memory': 0,
    'register_file': 0,
    'l1_cache': 0,
    'l2_cache': 0,
    'texture_memory': 'N/A',
    'total': '0'},
   'double_bit': {'device_memory': 0,
    'register_file': 0,
    'l1_cache': 0,
    'l2_cache': 0,
    'texture_memory': 'N/A',
    'total': '0'}}},
 'retired_pages': {'multiple_single_bit_retirement': None,
  'double_bit_retirement': None,
  'pending_retirement': 'No'},
 'temperature': {'gpu_temp': 31,
  'gpu_temp_max_threshold': 90,
  'gpu_temp_slow_threshold': 87,
  'unit': 'C'},
 'power_readings': {'power_management': 'Supported',
  'power_draw': 43.235,
  'power_limit': 300.0,
  'default_power_limit': 300.0,
  'enforced_power_limit': 300.0,
  'min_power_limit': 150.0,
  'max_power_limit': 300.0,
  'power_state': 'P0',
  'unit': 'W'},
 'clocks': {'graphics_clock': 135,
  'sm_clock': 135,
  'mem_clock': 877,
  'unit': 'MHz'},
 'applications_clocks': {'graphics_clock': 1290,
  'mem_clock': 877,
  'unit': 'MHz'},
 'default_applications_clocks': {'graphics_clock': 1290,
  'mem_clock': 877,
  'unit': 'MHz'},
 'max_clocks': {'graphics_clock': 1530,
  'sm_clock': 1530,
  'mem_clock': 877,
  'unit': 'MHz'},
 'clock_policy': {'auto_boost': 'N/A', 'auto_boost_default': 'N/A'},
 'supported_clocks': [{'current': 877,
   'unit': 'MHz',
   'supported_graphics_clock': [1530,
    1522,
    1515,
    1507,
    1500,
    1492,
    1485,
    1477,
    1470,
    1462,
    1455,
    1447,
    1440,
    1432,
    1425,
    1417,
    1410,
    1402,
    1395,
    1387,
    1380,
    1372,
    1365,
    1357,
    1350,
    1342,
    1335,
    1327,
    1320,
    1312,
    1305,
    1297,
    1290,
    1282,
    1275,
    1267,
    1260,
    1252,
    1245,
    1237,
    1230,
    1222,
    1215,
    1207,
    1200,
    1192,
    1185,
    1177,
    1170,
    1162,
    1155,
    1147,
    1140,
    1132,
    1125,
    1117,
    1110,
    1102,
    1095,
    1087,
    1080,
    1072,
    1065,
    1057,
    1050,
    1042,
    1035,
    1027,
    1020,
    1012,
    1005,
    997,
    990,
    982,
    975,
    967,
    960,
    952,
    945,
    937,
    930,
    922,
    915,
    907,
    900,
    892,
    885,
    877,
    870,
    862,
    855,
    847,
    840,
    832,
    825,
    817,
    810,
    802,
    795,
    787,
    780,
    772,
    765,
    757,
    750,
    742,
    735,
    727,
    720,
    712,
    705,
    697,
    690,
    682,
    675,
    667,
    660,
    652,
    645,
    637,
    630,
    622,
    615,
    607,
    600,
    592,
    585,
    577,
    570,
    562,
    555,
    547,
    540,
    532,
    525,
    517,
    510,
    502,
    495,
    487,
    480,
    472,
    465,
    457,
    450,
    442,
    435,
    427,
    420,
    412,
    405,
    397,
    390,
    382,
    375,
    367,
    360,
    352,
    345,
    337,
    330,
    322,
    315,
    307,
    300,
    292,
    285,
    277,
    270,
    262,
    255,
    247,
    240,
    232,
    225,
    217,
    210,
    202,
    195,
    187,
    180,
    172,
    165,
    157,
    150,
    142,
    135]}],
 'processes': None,
 'accounted_processes': None}

cc @seibert @kkraus14 @sklam @randerzander

mrocklin commented 5 years ago

Some useful nvidia-smi queries from the web

seibert commented 5 years ago

The main things I want to see dashboarded over time are the utilization % numbers:

If they mean what I think they do, the memory transfer counters would also be super useful:

gpu_temp might be interesting to detect hardware issues if you are working with less well engineered servers and workstations. Power consumption is entertaining, but not really important to understand performance.

mrocklin commented 5 years ago

Thanks @seibert ! I missed the tx_util and rx_util values entirely in there. Thanks for highlighting them.