munin-monitoring / munin

Main repository for munin master / node / plugins
http://munin-monitoring.org
Other
2k stars 474 forks source link

sensors_ plugin reports invalid values only for certain days #1626

Closed baranyaib90 closed 4 months ago

baranyaib90 commented 4 months ago

Describe the bug I'm using sensors_ plugin to monitor CPU temperature. I have noticed that sometimes the plugin reports higher temperatures randomly, but only for 1 day period - starting and ending at midnight: 00:00). See attached pictures for what I mean. I thought, that the issue has vanished, but I have seen it Today.

To Reproduce No idea, it just happens.

Expected behavior This kind of false measurement should not happen. Room temperature is almost the same trough the week.

Screenshots & Logs Older: temp1 (Drop between June 24-25 was a few hour shutdown.)

Today: temp2

Desktop (please complete the following information):

Additional context I tried to debug, this is how far I get: I ran the following 3 command almost in same time. To me it seems like the issue happens when the plugin is being run by munin. (Temp 3-6 values differs when running sensors_ with munin-run: 53 instead of 44.)

$ sensors
acpitz-acpi-0
Adapter: ACPI interface
temp1:        +27.8°C

coretemp-isa-0000
Adapter: ISA adapter
Package id 0:  +46.0°C  (high = +105.0°C, crit = +105.0°C)
Core 0:        +44.0°C  (high = +105.0°C, crit = +105.0°C)
Core 1:        +44.0°C  (high = +105.0°C, crit = +105.0°C)
Core 2:        +46.0°C  (high = +105.0°C, crit = +105.0°C)
Core 3:        +46.0°C  (high = +105.0°C, crit = +105.0°C)

nct6798-isa-0290
Adapter: ISA adapter
in0:                      624.00 mV (min =  +0.00 V, max =  +1.74 V)
in1:                        1.01 V  (min =  +0.00 V, max =  +0.00 V)  ALARM
in2:                        3.39 V  (min =  +0.00 V, max =  +0.00 V)  ALARM
in3:                        3.34 V  (min =  +0.00 V, max =  +0.00 V)  ALARM
in4:                        1.02 V  (min =  +0.00 V, max =  +0.00 V)  ALARM
in5:                      768.00 mV (min =  +0.00 V, max =  +0.00 V)
in6:                        1.01 V  (min =  +0.00 V, max =  +0.00 V)  ALARM
in7:                        3.39 V  (min =  +0.00 V, max =  +0.00 V)  ALARM
in8:                        3.15 V  (min =  +0.00 V, max =  +0.00 V)  ALARM
in9:                        1.02 V  (min =  +0.00 V, max =  +0.00 V)  ALARM
in10:                       1.20 V  (min =  +0.00 V, max =  +0.00 V)  ALARM
in11:                       1.06 V  (min =  +0.00 V, max =  +0.00 V)  ALARM
in12:                       1.01 V  (min =  +0.00 V, max =  +0.00 V)  ALARM
in13:                       1.01 V  (min =  +0.00 V, max =  +0.00 V)  ALARM
in14:                       1.01 V  (min =  +0.00 V, max =  +0.00 V)  ALARM
fan1:                     1240 RPM  (min =    0 RPM)
fan2:                        0 RPM  (min =    0 RPM)
fan7:                        0 RPM  (min =    0 RPM)
SYSTIN:                    +34.0°C  (high = +80.0°C, hyst = +75.0°C)
                                    (crit = +125.0°C)  sensor = thermistor
CPUTIN:                    +45.5°C  (high = +80.0°C, hyst = +75.0°C)
                                    (crit = +125.0°C)  sensor = thermistor
AUXTIN0:                   +26.0°C  (high = +80.0°C, hyst = +75.0°C)
                                    (crit = +100.0°C)  sensor = thermistor
AUXTIN1:                   +15.0°C  (high = +80.0°C, hyst = +75.0°C)
                                    (crit = +100.0°C)  sensor = thermistor
AUXTIN2:                   +23.0°C  (high = +80.0°C, hyst = +75.0°C)
                                    (crit = +100.0°C)  sensor = thermistor
AUXTIN3:                   +26.0°C  (high = +80.0°C, hyst = +75.0°C)
                                    (crit = +100.0°C)  sensor = thermistor
AUXTIN4:                   +39.0°C  (high = +118.0°C, hyst = +115.0°C)
                                    (crit = +100.0°C)
PECI Agent 0 Calibration:  +45.0°C  (high = +80.0°C, hyst = +75.0°C)
PCH_CHIP_CPU_MAX_TEMP:      +0.0°C
PCH_CHIP_TEMP:              +0.0°C
PCH_CPU_TEMP:               +0.0°C
PCH_MCH_TEMP:               +0.0°C
intrusion0:               ALARM
intrusion1:               OK
beep_enable:              disabled

/etc/munin/plugins$ sudo munin-run sensors_temp --debug
# Running 'munin-run' via 'systemd-run' with systemd properties based on 'munin-node.service'.
# Command invocation: systemd-run --collect --pipe --quiet --wait --property EnvironmentFile=/tmp/p4tK3jLfIR --property UMask=0022 --property LimitCPU=infinity --property LimitFSIZE=infinity --property LimitDATA=infinity --property LimitSTACK=infinity --property LimitCORE=infinity --property LimitRSS=infinity --property LimitNOFILE=524288 --property LimitAS=infinity --property LimitNPROC=62665 --property LimitMEMLOCK=8388608 --property LimitLOCKS=infinity --property LimitSIGPENDING=62665 --property LimitMSGQUEUE=819200 --property LimitNICE=0 --property LimitRTPRIO=0 --property LimitRTTIME=infinity --property SecureBits=0 --property 'CapabilityBoundingSet=cap_chown cap_dac_override cap_dac_read_search cap_fowner cap_fsetid cap_kill cap_setgid cap_setuid cap_setpcap cap_linux_immutable cap_net_bind_service cap_net_broadcast cap_net_admin cap_net_raw cap_ipc_lock cap_ipc_owner cap_sys_module cap_sys_rawio cap_sys_chroot cap_sys_ptrace cap_sys_pacct cap_sys_admin cap_sys_boot cap_sys_nice cap_sys_resource cap_sys_time cap_sys_tty_config cap_mknod cap_lease cap_audit_write cap_audit_control cap_setfcap cap_mac_override cap_mac_admin cap_syslog cap_wake_alarm cap_block_suspend cap_audit_read cap_perfmon cap_bpf cap_checkpoint_restore' --property DynamicUser=no --property PrivateTmp=yes --property PrivateDevices=no --property ProtectClock=no --property ProtectKernelTunables=no --property ProtectKernelModules=no --property ProtectKernelLogs=no --property ProtectControlGroups=no --property PrivateNetwork=no --property PrivateUsers=no --property PrivateMounts=no --property PrivateIPC=no --property ProtectHome=yes --property ProtectSystem=full --property NoNewPrivileges=no --property LockPersonality=no --property MemoryDenyWriteExecute=no --property RestrictRealtime=no --property RestrictSUIDSGID=no --property RestrictNamespaces=no --property ProtectProc=default --property ProtectHostname=no -- /usr/sbin/munin-run --ignore-systemd-properties sensors_temp --debug
# Processing plugin configuration from /etc/munin/plugin-conf.d/README
# Processing plugin configuration from /etc/munin/plugin-conf.d/dhcpd3
# Processing plugin configuration from /etc/munin/plugin-conf.d/https_dns_proxy
# Processing plugin configuration from /etc/munin/plugin-conf.d/munin-node
# Processing plugin configuration from /etc/munin/plugin-conf.d/spamstats
# Setting /rgid/ruid/ to /117/65534/
# Setting /egid/euid/ to /117 117/65534/
# Setting up environment
# Environment ignore_temp17 = yes
# Environment ignore_temp11 = yes
# Environment ignore_temp10 = yes
# Environment ignore_temp14 = yes
# Environment ignore_temp12 = yes
# Environment ignore_temp1 = yes
# Environment ignore_temp16 = yes
# Environment ignore_temp18 = yes
# Environment ignore_temp13 = yes
# Environment ignore_temp15 = yes
# Environment ignore_temp8 = yes
# Environment ignore_temp7 = yes
# Environment ignore_temp9 = yes
# About to run '/etc/munin/plugins/sensors_temp'
temp1.value 27.8
temp2.value 53.0
temp3.value 53.0
temp4.value 53.0
temp5.value 53.0
temp6.value 53.0
temp7.value 34.0
temp8.value 45.5
temp9.value 26.0
temp10.value 15.0
temp11.value 23.0
temp12.value 26.0
temp13.value 39.0
temp14.value 45.0
temp15.value 0.0
temp16.value 0.0
temp17.value 0.0
temp18.value 0.0
/etc/munin/plugins$ /etc/munin/plugins/sensors_temp
temp1.value 27.8
temp2.value 46.0
temp3.value 44.0
temp4.value 44.0
temp5.value 44.0
temp6.value 44.0
temp7.value 34.0
temp8.value 45.5
temp9.value 25.0
temp10.value 15.0
temp11.value 23.0
temp12.value 26.0
temp13.value 39.0
temp14.value 45.0
temp15.value 0.0
temp16.value 0.0
temp17.value 0.0
temp18.value 0.0
kenyon commented 4 months ago

Not sure we can do anything about that. The plugin is just reporting the values it's given.

baranyaib90 commented 4 months ago

I'm sorry, but I could prove you are wrong:

  1. I have added the following in /etc/munin/plugin-conf.d/munin-node: [sensors_temp] env.sensors /bin/mysensors.sh

  2. /bin/mysensors.sh contained:

    !/bin/bash

    sensors | tee /tmp/sensors.txt

  3. Values of temp 2-6 in /tmp/sensors.txt file do not match plugin output at all:

    
    /etc/munin/plugins$ sudo munin-run sensors_temp --debug
    # Running 'munin-run' via 'systemd-run' with systemd properties based on 'munin-node.service'.
    # Command invocation: systemd-run --collect --pipe --quiet --wait --property EnvironmentFile=/tmp/w9vfLxXQ0j --property UMask=0022 --property LimitCPU=infinity --property LimitFSIZE=infinity --property LimitDATA=infinity --property LimitSTACK=infinity --property LimitCORE=infinity --property LimitRSS=infinity --property LimitNOFILE=524288 --property LimitAS=infinity --property LimitNPROC=62665 --property LimitMEMLOCK=8388608 --property LimitLOCKS=infinity --property LimitSIGPENDING=62665 --property LimitMSGQUEUE=819200 --property LimitNICE=0 --property LimitRTPRIO=0 --property LimitRTTIME=infinity --property SecureBits=0 --property 'CapabilityBoundingSet=cap_chown cap_dac_override cap_dac_read_search cap_fowner cap_fsetid cap_kill cap_setgid cap_setuid cap_setpcap cap_linux_immutable cap_net_bind_service cap_net_broadcast cap_net_admin cap_net_raw cap_ipc_lock cap_ipc_owner cap_sys_module cap_sys_rawio cap_sys_chroot cap_sys_ptrace cap_sys_pacct cap_sys_admin cap_sys_boot cap_sys_nice cap_sys_resource cap_sys_time cap_sys_tty_config cap_mknod cap_lease cap_audit_write cap_audit_control cap_setfcap cap_mac_override cap_mac_admin cap_syslog cap_wake_alarm cap_block_suspend cap_audit_read cap_perfmon cap_bpf cap_checkpoint_restore' --property DynamicUser=no --property PrivateTmp=yes --property PrivateDevices=no --property ProtectClock=no --property ProtectKernelTunables=no --property ProtectKernelModules=no --property ProtectKernelLogs=no --property ProtectControlGroups=no --property PrivateNetwork=no --property PrivateUsers=no --property PrivateMounts=no --property PrivateIPC=no --property ProtectHome=yes --property ProtectSystem=full --property NoNewPrivileges=no --property LockPersonality=no --property MemoryDenyWriteExecute=no --property RestrictRealtime=no --property RestrictSUIDSGID=no --property RestrictNamespaces=no --property ProtectProc=default --property ProtectHostname=no -- /usr/sbin/munin-run --ignore-systemd-properties sensors_temp --debug
    # Processing plugin configuration from /etc/munin/plugin-conf.d/README
    # Processing plugin configuration from /etc/munin/plugin-conf.d/dhcpd3
    # Processing plugin configuration from /etc/munin/plugin-conf.d/https_dns_proxy
    # Processing plugin configuration from /etc/munin/plugin-conf.d/munin-node
    # Processing plugin configuration from /etc/munin/plugin-conf.d/spamstats
    # Setting /rgid/ruid/ to /117/65534/
    # Setting /egid/euid/ to /117 117/65534/
    # Setting up environment
    # Environment ignore_temp10 = yes
    # Environment ignore_temp11 = yes
    # Environment ignore_temp12 = yes
    # Environment ignore_temp16 = yes
    # Environment ignore_temp9 = yes
    # Environment ignore_temp1 = yes
    # Environment ignore_temp7 = yes
    # Environment sensors = /bin/mysensors.sh
    # Environment ignore_temp13 = yes
    # Environment ignore_temp15 = yes
    # Environment ignore_temp8 = yes
    # Environment ignore_temp17 = yes
    # Environment ignore_temp14 = yes
    # Environment ignore_temp18 = yes
    # About to run '/etc/munin/plugins/sensors_temp'
    temp1.value 27.8
    temp2.value 52.0
    temp3.value 52.0
    temp4.value 52.0
    temp5.value 52.0
    temp6.value 52.0
    temp7.value 34.0
    temp8.value 46.0
    temp9.value 26.0
    temp10.value 15.0
    temp11.value 23.0
    temp12.value 26.0
    temp13.value 39.0
    temp14.value 46.0
    temp15.value 0.0
    temp16.value 0.0
    temp17.value 0.0
    temp18.value 0.0

/etc/munin/plugins$ cat /tmp/sensors.txt acpitz-acpi-0 Adapter: ACPI interface temp1: +27.8 C

coretemp-isa-0000 Adapter: ISA adapter Package id 0: +46.0 C (high = +105.0 C, crit = +105.0 C) Core 0: +46.0 C (high = +105.0 C, crit = +105.0 C) Core 1: +46.0 C (high = +105.0 C, crit = +105.0 C) Core 2: +46.0 C (high = +105.0 C, crit = +105.0 C) Core 3: +46.0 C (high = +105.0 C, crit = +105.0 C)

nct6798-isa-0290 Adapter: ISA adapter in0: 616.00 mV (min = +0.00 V, max = +1.74 V) in1: 1.01 V (min = +0.00 V, max = +0.00 V) ALARM in2: 3.39 V (min = +0.00 V, max = +0.00 V) ALARM in3: 3.34 V (min = +0.00 V, max = +0.00 V) ALARM in4: 1.02 V (min = +0.00 V, max = +0.00 V) ALARM in5: 768.00 mV (min = +0.00 V, max = +0.00 V) in6: 1.01 V (min = +0.00 V, max = +0.00 V) ALARM in7: 3.39 V (min = +0.00 V, max = +0.00 V) ALARM in8: 3.15 V (min = +0.00 V, max = +0.00 V) ALARM in9: 1.02 V (min = +0.00 V, max = +0.00 V) ALARM in10: 1.20 V (min = +0.00 V, max = +0.00 V) ALARM in11: 1.06 V (min = +0.00 V, max = +0.00 V) ALARM in12: 1.01 V (min = +0.00 V, max = +0.00 V) ALARM in13: 1.01 V (min = +0.00 V, max = +0.00 V) ALARM in14: 1.01 V (min = +0.00 V, max = +0.00 V) ALARM fan1: 1262 RPM (min = 0 RPM) fan2: 0 RPM (min = 0 RPM) fan7: 0 RPM (min = 0 RPM) SYSTIN: +34.0 C (high = +80.0 C, hyst = +75.0 C) (crit = +125.0 C) sensor = thermistor CPUTIN: +46.0 C (high = +80.0 C, hyst = +75.0 C) (crit = +125.0 C) sensor = thermistor AUXTIN0: +26.0 C (high = +80.0 C, hyst = +75.0 C) (crit = +100.0 C) sensor = thermistor AUXTIN1: +15.0 C (high = +80.0 C, hyst = +75.0 C) (crit = +100.0 C) sensor = thermistor AUXTIN2: +23.0 C (high = +80.0 C, hyst = +75.0 C) (crit = +100.0 C) sensor = thermistor AUXTIN3: +26.0 C (high = +80.0 C, hyst = +75.0 C) (crit = +100.0 C) sensor = thermistor AUXTIN4: +39.0 C (high = +118.0 C, hyst = +115.0 C) (crit = +100.0 C) PECI Agent 0 Calibration: +46.0 C (high = +80.0 C, hyst = +75.0 C) PCH_CHIP_CPU_MAX_TEMP: +0.0 C PCH_CHIP_TEMP: +0.0 C PCH_CPU_TEMP: +0.0 C PCH_MCH_TEMP: +0.0 C intrusion0: ALARM intrusion1: OK beep_enable: disabled

kenyon commented 4 months ago

You'll have to do some debugging to figure out where those values are coming from. Also, showing the output of sudo munin-run sensors_temp config should help, because without that it's hard to tell which value corresponds to which sensor.

baranyaib90 commented 4 months ago

Values can not be something else, just the CPU temperatures.

The day has passed, values back to normal (less fluctuation, peaks are reasonable according to CPU load): image

The output of the requested command:

$ sudo munin-run sensors_temp config
graph_title Temperatures
graph_vlabel degrees Celsius
graph_args --base 1000
graph_category sensors
temp1.label temp1
temp1.graph no
temp2.label Package id 0
temp2.warning 105.0
temp2.critical 105.0
temp3.label Core 0
temp3.warning 105.0
temp3.critical 105.0
temp4.label Core 1
temp4.warning 105.0
temp4.critical 105.0
temp5.label Core 2
temp5.warning 105.0
temp5.critical 105.0
temp6.label Core 3
temp6.warning 105.0
temp6.critical 105.0
temp7.label SYSTIN
temp7.warning 75.0
temp7.critical 80.0
temp7.graph no
temp8.label CPUTIN
temp8.warning 75.0
temp8.critical 80.0
temp8.graph no
temp9.label AUXTIN0
temp9.warning 75.0
temp9.critical 80.0
temp9.graph no
temp10.label AUXTIN1
temp10.warning 75.0
temp10.critical 80.0
temp10.graph no
temp11.label AUXTIN2
temp11.warning 75.0
temp11.critical 80.0
temp11.graph no
temp12.label AUXTIN3
temp12.warning 75.0
temp12.critical 80.0
temp12.graph no
temp13.label AUXTIN4
temp13.warning 115.0
temp13.critical 118.0
temp13.graph no
temp14.label PECI Agent 0 Calibration
temp14.warning 75.0
temp14.critical 80.0
temp14.graph no
temp15.label PCH_CHIP_CPU_MAX_TEMP
temp15.graph no
temp16.label PCH_CHIP_TEMP
temp16.graph no
temp17.label PCH_CPU_TEMP
temp17.graph no
temp18.label PCH_MCH_TEMP
temp18.graph no
kenyon commented 4 months ago

OK, so is there a bug?

baranyaib90 commented 4 months ago

There is a bug, but I have no idea how. Please read the ticket carefully. My point is: sometimes for a whole day (starting and ending at 00:00) the plugin reports wrong temperature values of CPU package and cores. I have not found any reasonable explanation. In my previous comment you can see (where I have created the /bin/mysensors.sh sensors wrapper), that the plugin did not reported the valid value of the sensors output. Do you get my point?

kenyon commented 4 months ago

The temperature can be different any time you observe the sensor.

Without some more information, I don't see a bug here.

baranyaib90 commented 4 months ago

You did not understand how I proved you wrong... There was no temperature change at all. This was a waste of time. Thanks for nothing.

kenyon commented 4 months ago

All you've shown is that the temperature changes sometimes. If there is a bug, you'll have to show exactly where the bug lives.