prometheus / node_exporter

Exporter for machine metrics
https://prometheus.io/
Apache License 2.0
11.1k stars 2.35k forks source link

Handle thermal_zone errors gracefully #2980

Open scotts-tp opened 6 months ago

scotts-tp commented 6 months ago

Host operating system:

Linux 5.10.104-tegra #18 SMP PREEMPT aarch64 aarch64 aarch64 GNU/Linux

node_exporter version:

1.7.0

node_exporter command line flags:

--path.rootfs=/host

node_exporter log output

...
caller=collector.go:169 level=error msg="collector failed" name=thermal_zone duration_seconds=0.01870677 err="read /sys/class/thermal/thermal_zone10/temp: invalid argument"
caller=collector.go:169 level=error msg="collector failed" name=thermal_zone duration_seconds=0.001411717 err="read /sys/class/thermal/thermal_zone10/temp: invalid argument"
...

Are you running node_exporter in Docker?

Yes

What did you do that produced an error?

Running node_exporter in a docker container on a custom embedded device.

What did you expect to see?

Disabled thermal zones as either being ignored or optionally being filtered out.

What did you see instead?

The entire thermal_zone collector fails for all thermal_zones.

When a thermal zone is disabled which can be determined via /sys/class/thermal/thermal_zone10/mode, it would be nice for node_exporter to handle it gracefully whether natively or via flag, or allow specific files/devices be filtered out manually instead of as an entire class of devices.

My temporry workaround has been to use the Pushgateway with a curl container in my docker compose file as so:

  pushgateway:
    image: prom/pushgateway
    container_name: pushgateway
    restart: unless-stopped
    networks:
      - metrics
  curl_thermals:
    image: curlimages/curl
    container_name: curl_thermals
    command: '/bin/sh /pushgateway-thermal-zones.sh'
    pid: host
    restart: unless-stopped
    volumes:
      - /:/host:ro,rslave
      - ./pushgateway-thermal-zones.sh:/pushgateway-thermal-zones.sh:ro,rslave
    networks:
      - metrics

With this pushgateway-thermal-zones.sh script:

while true
do 
    output="# TYPE thermal_zone gauge\n# HELP thermal_zone Thermal zone temperatures in Celsius\n"

    # Loop through each thermal zone directory in /host/sys/class/thermal
    for zone in /host/sys/class/thermal/thermal_zone*; do
        # Check if the thermal zone is enabled by reading the mode file
        mode=$(cat "${zone}/mode")
        if [ "${mode}" = "enabled" ]; then
            zone_number=$(basename ${zone} | sed 's/thermal_zone//')
            zone_type=$(cat "${zone}/type")
            zone_temp=$(cat "${zone}/temp")
            zone_temp_scaled=$(echo "scale=2; ${zone_temp} / 1000.0" | bc)

            # Append the details to the output variable
            output="${output}thermal_zone{zone=\"${zone_number}\", type=\"${zone_type}\"} ${zone_temp_scaled}\n"
        fi
    done

    echo -e $output | curl -s --data-binary @- http://pushgateway:9091/metrics/job/thermal_zones/
    sleep 3
done
Kylea650 commented 6 months ago

Seems like the error is coming from here: https://github.com/prometheus/procfs/blob/69fc8f61debb3bd7efca3a9a1c295d4012022830/sysfs/class_thermal.go#L73 / https://github.com/prometheus/procfs/blob/69fc8f61debb3bd7efca3a9a1c295d4012022830/sysfs/class_thermal.go#L52 - maybe there should be a check here if the error is of type os.ErrInvalid and either return an empty ClassThermalZonesStat{} or ignore it. Another option could be to check the mode for ‘disabled’ first in parseClassThermalZone() and return early.

not sure how to achieve this directly from node_exporter.

discordianfish commented 6 months ago

@Kylea650 Checking mode for disabled sounds like a good option. If anyone wants to submit a PR to sysfs feel free to ping me there

Kylea650 commented 6 months ago

@discordianfish Happy to raise a new issue mentioning this one and PR over in sysfs this week. Cheers!

parthlaw commented 1 month ago

Is this issue still open?