prometheus / procfs

procfs provides functions to retrieve system, kernel and process metrics from the pseudo-filesystem proc.
Apache License 2.0
769 stars 319 forks source link

Slice bound out of range on filterOfflineCPUs #530

Closed antham closed 1 year ago

antham commented 1 year ago

I got this stack trace

panic: runtime error: slice bounds out of range [4:3]
goroutine 70 [running]:
github.com/prometheus/procfs/sysfs.filterOfflineCPUs(0xc00031e600?, 0xc000243c08)
        /go/pkg/mod/github.com/prometheus/procfs@v0.10.0/sysfs/system_cpu.go:181 +0x23a
github.com/prometheus/procfs/sysfs.FS.SystemCpufreq({{0xbca467?, 0x4?}})
        /go/pkg/mod/github.com/prometheus/procfs@v0.10.0/sysfs/system_cpu.go:209 +0x28b
github.com/prometheus/node_exporter/collector.(*cpuFreqCollector).Update(0x0?, 0x0?)
        /app/collector/cpufreq_linux.go:51 +0x45
github.com/prometheus/node_exporter/collector.execute({0xbcd885, 0x7}, {0xce3840, 0xc000070860}, 0x0?, >
        /app/collector/collector.go:161 +0x9c
github.com/prometheus/node_exporter/collector.NodeCollector.Collect.func1({0xbcd885?, 0x0?}, {0xce3840?>
        /app/collector/collector.go:152 +0x3d
created by github.com/prometheus/node_exporter/collector.NodeCollector.Collect
        /app/collector/collector.go:151 +0xd0

I think what is provided is the following arguments :

filterOfflineCPUs(&[]uint16{2, 3}, &[]string{
    "/sys/devices/system/cpu/cpu0",
    "/sys/devices/system/cpu/cpu1",
    "/sys/devices/system/cpu/cpu2",
    "/sys/devices/system/cpu/cpu3",
})

I guess the problem is occurring because at each loop the slice is reducing, what about introducing another slice to store the result of the filtering ?

AdarshdeepCheema commented 1 year ago

We are also hitting this on one of our 3 systems

AdarshdeepCheema commented 1 year ago

Seems like it is due to https://github.com/prometheus/procfs/pull/497 The system that has the failure has 8 CPUs and only 2 of them are online the systems that had no issue have 4 CPUs and all of them are online

taherkk commented 1 year ago

@antham This does seem to be the issue and your guess is correct. I have fixed the code and will raise PR shortly

antham commented 1 year ago

Ok thank you :+1: