influxdata / telegraf

Agent for collecting, processing, aggregating, and writing metrics, logs, and other arbitrary data.
https://influxdata.com/telegraf
MIT License
14.6k stars 5.56k forks source link

[input.nvidia_smi] Power monitoring not working as of 1.28.0 #13954

Closed mbentley closed 1 year ago

mbentley commented 1 year ago

Relevant telegraf.conf

[[inputs.nvidia_smi]]
  ## Optional: path to nvidia-smi binary, defaults to $PATH via exec.LookPath
  bin_path = "/usr/bin/nvidia-smi"

  ## Optional: timeout for GPU polling
  timeout = "5s"

Logs from Telegraf

# with version 1.27.4-1
# telegraf --debug --config /etc/telegraf/telegraf.conf --test --input-filter nvidia_smi
2023-09-20T13:59:21Z I! Loading config: /etc/telegraf/telegraf.conf
2023-09-20T13:59:21Z I! Starting Telegraf 1.27.4
2023-09-20T13:59:21Z I! Available plugins: 237 inputs, 9 aggregators, 28 processors, 23 parsers, 59 outputs, 4 secret-stores
2023-09-20T13:59:21Z I! Loaded inputs: nvidia_smi
2023-09-20T13:59:21Z I! Loaded aggregators:
2023-09-20T13:59:21Z I! Loaded processors:
2023-09-20T13:59:21Z I! Loaded secretstores:
2023-09-20T13:59:21Z W! Outputs are not used in testing mode!
2023-09-20T13:59:21Z I! Tags enabled: host=athena
2023-09-20T13:59:21Z D! [agent] Initializing plugins
2023-09-20T13:59:21Z D! [agent] Starting service inputs
2023-09-20T13:59:21Z D! [agent] Stopping service inputs
2023-09-20T13:59:21Z D! [agent] Input channel closed
2023-09-20T13:59:21Z D! [agent] Stopped Successfully
> nvidia_smi,compute_mode=Default,host=athena,index=0,name=Quadro\ P2000,pstate=P8,uuid=GPU-396caaed-39ca-3199-2e68-717cdb786ec6 clocks_current_graphics=139i,clocks_current_memory=405i,clocks_current_sm=139i,clocks_current_video=544i,cuda_version="12.0",driver_version="525.125.06",encoder_stats_average_fps=0i,encoder_stats_average_latency=0i,encoder_stats_session_count=0i,fan_speed=46i,fbc_stats_average_fps=0i,fbc_stats_average_latency=0i,fbc_stats_session_count=0i,memory_free=5051i,memory_reserved=66i,memory_total=5120i,memory_used=1i,pcie_link_gen_current=1i,pcie_link_width_current=8i,power_draw=4.6,temperature_gpu=33i,utilization_decoder=0i,utilization_encoder=0i,utilization_gpu=0i,utilization_memory=0i 1695218362000000000

# with version 1.28.1-1
# telegraf --debug --config /etc/telegraf/telegraf.conf --test --input-filter nvidia_smi
2023-09-20T14:00:16Z I! Loading config: /etc/telegraf/telegraf.conf
2023-09-20T14:00:16Z I! Starting Telegraf 1.28.1 brought to you by InfluxData the makers of InfluxDB
2023-09-20T14:00:16Z I! Available plugins: 240 inputs, 9 aggregators, 29 processors, 24 parsers, 59 outputs, 5 secret-stores
2023-09-20T14:00:16Z I! Loaded inputs: nvidia_smi
2023-09-20T14:00:16Z I! Loaded aggregators:
2023-09-20T14:00:16Z I! Loaded processors:
2023-09-20T14:00:16Z I! Loaded secretstores:
2023-09-20T14:00:16Z W! Outputs are not used in testing mode!
2023-09-20T14:00:16Z I! Tags enabled: host=athena
2023-09-20T14:00:16Z D! [agent] Initializing plugins
2023-09-20T14:00:16Z D! [agent] Starting service inputs
2023-09-20T14:00:16Z D! [inputs.nvidia_smi] Using schema version in v12
2023-09-20T14:00:16Z D! [agent] Stopping service inputs
2023-09-20T14:00:16Z D! [agent] Input channel closed
2023-09-20T14:00:16Z D! [agent] Stopped Successfully
> nvidia_smi,arch=Pascal,compute_mode=Default,host=athena,index=0,name=Quadro\ P2000,pstate=P8,uuid=GPU-396caaed-39ca-3199-2e68-717cdb786ec6 clocks_current_graphics=139i,clocks_current_memory=405i,clocks_current_sm=139i,clocks_current_video=544i,cuda_version="12.0",display_active="Disabled",display_mode="Disabled",driver_version="525.125.06",encoder_stats_average_fps=0i,encoder_stats_average_latency=0i,encoder_stats_session_count=0i,fan_speed=46i,fbc_stats_average_fps=0i,fbc_stats_average_latency=0i,fbc_stats_session_count=0i,memory_free=5051i,memory_reserved=66i,memory_total=5120i,memory_used=1i,pcie_link_gen_current=1i,pcie_link_width_current=8i,serial="0322218049033",temperature_gpu=33i,utilization_decoder=0i,utilization_encoder=0i,utilization_gpu=0i,utilization_memory=0i,vbios_version="86.06.3F.00.30" 1695218416000000000

System info

Telegraf 1.28.1-1, Debian 12 (bookworm), nvidia-driver & nvidia-smi 525.125.06-1~deb12u1

Docker

No response

Steps to reproduce

  1. Install telegraf 1.27.4 (apt install telegraf=1.27.4-1 -y --allow-downgrades)
  2. See if power_draw is present in the output (it should be)telegraf --config /etc/telegraf/telegraf.conf --test --input-filter nvidia_smi | grep power_draw
  3. Upgrade telegraf to 1.28.1 (apt install telegraf=1.28.1-1 -y)
  4. See if power_draw is present in the output (it isn't) telegraf --config /etc/telegraf/telegraf.conf --test --input-filter nvidia_smi | grep power_draw ...

Expected behavior

I would expect power_draw to be there.

Actual behavior

It's not there.

Additional info

I have an NVIDIA Quadro P2000 in my linux box running Debian 12 (bookworm) with nvidia-driver & nvidia-smi installed from the Debian repos (package versions are both 525.125.06-1~deb12u1). I have telegraf 1.28.1-1 installed as well and I am not getting power_draw from the telegraf output any longer.

On the 12th, telegraf was updated to 1.28.0-1 from 1.27.4-1. If I roll back to version 1.27.4-1, I get power_draw metrics back.

Here is the output from nvidia-smi -x -q: https://gist.github.com/mbentley/3f0929563e4b4ecf0dde9ff30cd6dd1b

Looks like the doctype shows <!DOCTYPE nvidia_smi_log SYSTEM "nvsmi_device_v12.dtd"> but I have power_readings and not the two new blocks of gpu_power_readings and module_power_readings.

Snippet from the gist above:

<?xml version="1.0" ?>
<!DOCTYPE nvidia_smi_log SYSTEM "nvsmi_device_v12.dtd">
...
        <power_readings>
            <power_state>P8</power_state>
            <power_management>Supported</power_management>
            <power_draw>4.61 W</power_draw>
            <power_limit>75.00 W</power_limit>
            <default_power_limit>75.00 W</default_power_limit>
            <enforced_power_limit>75.00 W</enforced_power_limit>
            <min_power_limit>75.00 W</min_power_limit>
            <max_power_limit>75.00 W</max_power_limit>
        </power_readings>
...

I am not sure if this is an issue with nvidia-smi or telegraf so my apologies if it's not a telegraf issue. May be relevant to https://github.com/influxdata/telegraf/issues/13653 / https://github.com/influxdata/telegraf/pull/13678

powersj commented 1 year ago

I am not sure if this is an issue with nvidia-smi or telegraf so my apologies if it's not a telegraf issue.

100% our issue. The schema version PR changed how your data was parsed. Even though you are using the v12 schema it appears that the power_readings section can still exist and was not entirely replaced with the newer module_power_readings section.

I have put up #13962, which once tests pass will have artifacts attached to it via a comment from the "Telegarf Tiger Bot" (or similar). Could you download one of the artifacts and verify that the power draw field returns?

Thanks!

mbentley commented 1 year ago

on 1.27.4:

# telegraf --config /etc/telegraf/telegraf.conf --test --input-filter nvidia_smi | grep power_draw
2023-09-20T20:29:13Z I! Loading config: /etc/telegraf/telegraf.conf
2023-09-20T20:29:13Z I! Starting Telegraf 1.27.4
2023-09-20T20:29:13Z I! Available plugins: 237 inputs, 9 aggregators, 28 processors, 23 parsers, 59 outputs, 4 secret-stores
2023-09-20T20:29:13Z I! Loaded inputs: nvidia_smi
2023-09-20T20:29:13Z I! Loaded aggregators:
2023-09-20T20:29:13Z I! Loaded processors:
2023-09-20T20:29:13Z I! Loaded secretstores:
2023-09-20T20:29:13Z W! Outputs are not used in testing mode!
2023-09-20T20:29:13Z I! Tags enabled: host=athena
> nvidia_smi,compute_mode=Default,host=athena,index=0,name=Quadro\ P2000,pstate=P8,uuid=GPU-396caaed-39ca-3199-2e68-717cdb786ec6 clocks_current_graphics=139i,clocks_current_memory=405i,clocks_current_sm=139i,clocks_current_video=544i,cuda_version="12.0",driver_version="525.125.06",encoder_stats_average_fps=0i,encoder_stats_average_latency=0i,encoder_stats_session_count=0i,fan_speed=45i,fbc_stats_average_fps=0i,fbc_stats_average_latency=0i,fbc_stats_session_count=0i,memory_free=5049i,memory_reserved=66i,memory_total=5120i,memory_used=4i,pcie_link_gen_current=1i,pcie_link_width_current=8i,power_draw=4.69,temperature_gpu=32i,utilization_decoder=0i,utilization_encoder=0i,utilization_gpu=0i,utilization_memory=0i 1695241754000000000

With the PR artifact:

# telegraf --config /etc/telegraf/telegraf.conf --test --input-filter nvidia_smi | grep power_draw
2023-09-20T20:29:30Z I! Loading config: /etc/telegraf/telegraf.conf
2023-09-20T20:29:30Z I! Starting Telegraf 1.29.0-fbba2931 brought to you by InfluxData the makers of InfluxDB
2023-09-20T20:29:30Z I! Available plugins: 240 inputs, 9 aggregators, 29 processors, 24 parsers, 59 outputs, 5 secret-stores
2023-09-20T20:29:30Z I! Loaded inputs: nvidia_smi
2023-09-20T20:29:30Z I! Loaded aggregators:
2023-09-20T20:29:30Z I! Loaded processors:
2023-09-20T20:29:30Z I! Loaded secretstores:
2023-09-20T20:29:30Z W! Outputs are not used in testing mode!
2023-09-20T20:29:30Z I! Tags enabled: host=athena
> nvidia_smi,arch=Pascal,compute_mode=Default,host=athena,index=0,name=Quadro\ P2000,pstate=P8,uuid=GPU-396caaed-39ca-3199-2e68-717cdb786ec6 clocks_current_graphics=139i,clocks_current_memory=405i,clocks_current_sm=139i,clocks_current_video=544i,cuda_version="12.0",display_active="Disabled",display_mode="Disabled",driver_version="525.125.06",encoder_stats_average_fps=0i,encoder_stats_average_latency=0i,encoder_stats_session_count=0i,fan_speed=45i,fbc_stats_average_fps=0i,fbc_stats_average_latency=0i,fbc_stats_session_count=0i,memory_free=5049i,memory_reserved=66i,memory_total=5120i,memory_used=4i,pcie_link_gen_current=1i,pcie_link_width_current=8i,power_draw=4.59,serial="0322218049033",temperature_gpu=32i,utilization_decoder=0i,utilization_encoder=0i,utilization_gpu=0i,utilization_memory=0i,vbios_version="86.06.3F.00.30" 1695241770000000000

Looks like it's returning as expected in the PR - thanks!