Closed mbentley closed 1 year ago
I am not sure if this is an issue with nvidia-smi or telegraf so my apologies if it's not a telegraf issue.
100% our issue. The schema version PR changed how your data was parsed. Even though you are using the v12 schema it appears that the power_readings
section can still exist and was not entirely replaced with the newer module_power_readings
section.
I have put up #13962, which once tests pass will have artifacts attached to it via a comment from the "Telegarf Tiger Bot" (or similar). Could you download one of the artifacts and verify that the power draw field returns?
Thanks!
on 1.27.4
:
# telegraf --config /etc/telegraf/telegraf.conf --test --input-filter nvidia_smi | grep power_draw
2023-09-20T20:29:13Z I! Loading config: /etc/telegraf/telegraf.conf
2023-09-20T20:29:13Z I! Starting Telegraf 1.27.4
2023-09-20T20:29:13Z I! Available plugins: 237 inputs, 9 aggregators, 28 processors, 23 parsers, 59 outputs, 4 secret-stores
2023-09-20T20:29:13Z I! Loaded inputs: nvidia_smi
2023-09-20T20:29:13Z I! Loaded aggregators:
2023-09-20T20:29:13Z I! Loaded processors:
2023-09-20T20:29:13Z I! Loaded secretstores:
2023-09-20T20:29:13Z W! Outputs are not used in testing mode!
2023-09-20T20:29:13Z I! Tags enabled: host=athena
> nvidia_smi,compute_mode=Default,host=athena,index=0,name=Quadro\ P2000,pstate=P8,uuid=GPU-396caaed-39ca-3199-2e68-717cdb786ec6 clocks_current_graphics=139i,clocks_current_memory=405i,clocks_current_sm=139i,clocks_current_video=544i,cuda_version="12.0",driver_version="525.125.06",encoder_stats_average_fps=0i,encoder_stats_average_latency=0i,encoder_stats_session_count=0i,fan_speed=45i,fbc_stats_average_fps=0i,fbc_stats_average_latency=0i,fbc_stats_session_count=0i,memory_free=5049i,memory_reserved=66i,memory_total=5120i,memory_used=4i,pcie_link_gen_current=1i,pcie_link_width_current=8i,power_draw=4.69,temperature_gpu=32i,utilization_decoder=0i,utilization_encoder=0i,utilization_gpu=0i,utilization_memory=0i 1695241754000000000
With the PR artifact:
# telegraf --config /etc/telegraf/telegraf.conf --test --input-filter nvidia_smi | grep power_draw
2023-09-20T20:29:30Z I! Loading config: /etc/telegraf/telegraf.conf
2023-09-20T20:29:30Z I! Starting Telegraf 1.29.0-fbba2931 brought to you by InfluxData the makers of InfluxDB
2023-09-20T20:29:30Z I! Available plugins: 240 inputs, 9 aggregators, 29 processors, 24 parsers, 59 outputs, 5 secret-stores
2023-09-20T20:29:30Z I! Loaded inputs: nvidia_smi
2023-09-20T20:29:30Z I! Loaded aggregators:
2023-09-20T20:29:30Z I! Loaded processors:
2023-09-20T20:29:30Z I! Loaded secretstores:
2023-09-20T20:29:30Z W! Outputs are not used in testing mode!
2023-09-20T20:29:30Z I! Tags enabled: host=athena
> nvidia_smi,arch=Pascal,compute_mode=Default,host=athena,index=0,name=Quadro\ P2000,pstate=P8,uuid=GPU-396caaed-39ca-3199-2e68-717cdb786ec6 clocks_current_graphics=139i,clocks_current_memory=405i,clocks_current_sm=139i,clocks_current_video=544i,cuda_version="12.0",display_active="Disabled",display_mode="Disabled",driver_version="525.125.06",encoder_stats_average_fps=0i,encoder_stats_average_latency=0i,encoder_stats_session_count=0i,fan_speed=45i,fbc_stats_average_fps=0i,fbc_stats_average_latency=0i,fbc_stats_session_count=0i,memory_free=5049i,memory_reserved=66i,memory_total=5120i,memory_used=4i,pcie_link_gen_current=1i,pcie_link_width_current=8i,power_draw=4.59,serial="0322218049033",temperature_gpu=32i,utilization_decoder=0i,utilization_encoder=0i,utilization_gpu=0i,utilization_memory=0i,vbios_version="86.06.3F.00.30" 1695241770000000000
Looks like it's returning as expected in the PR - thanks!
Relevant telegraf.conf
Logs from Telegraf
System info
Telegraf 1.28.1-1, Debian 12 (bookworm), nvidia-driver & nvidia-smi 525.125.06-1~deb12u1
Docker
No response
Steps to reproduce
apt install telegraf=1.27.4-1 -y --allow-downgrades
)power_draw
is present in the output (it should be)telegraf --config /etc/telegraf/telegraf.conf --test --input-filter nvidia_smi | grep power_draw
apt install telegraf=1.28.1-1 -y
)power_draw
is present in the output (it isn't)telegraf --config /etc/telegraf/telegraf.conf --test --input-filter nvidia_smi | grep power_draw
...Expected behavior
I would expect
power_draw
to be there.Actual behavior
It's not there.
Additional info
I have an NVIDIA Quadro P2000 in my linux box running Debian 12 (bookworm) with
nvidia-driver
&nvidia-smi
installed from the Debian repos (package versions are both525.125.06-1~deb12u1
). I havetelegraf
1.28.1-1
installed as well and I am not gettingpower_draw
from the telegraf output any longer.On the 12th, telegraf was updated to
1.28.0-1
from1.27.4-1
. If I roll back to version1.27.4-1
, I getpower_draw
metrics back.Here is the output from
nvidia-smi -x -q
: https://gist.github.com/mbentley/3f0929563e4b4ecf0dde9ff30cd6dd1bLooks like the doctype shows
<!DOCTYPE nvidia_smi_log SYSTEM "nvsmi_device_v12.dtd">
but I havepower_readings
and not the two new blocks ofgpu_power_readings
andmodule_power_readings
.Snippet from the gist above:
I am not sure if this is an issue with
nvidia-smi
ortelegraf
so my apologies if it's not a telegraf issue. May be relevant to https://github.com/influxdata/telegraf/issues/13653 / https://github.com/influxdata/telegraf/pull/13678