tud-zih-energy / lo2s

Linux OTF2 Sampling - A Lightweight Node-Level Performance Monitoring Tool
https://tu-dresden.de/zih/forschung/projekte/lo2s?set_language=en
GNU General Public License v3.0
45 stars 13 forks source link

Nan collection in counter buffer #267

Closed tilsche closed 1 year ago

tilsche commented 1 year ago

I encountered a trace in which a collected metric uncore_clock/clockticks/ became NaN at some point and never reverted to a valid state.

After a brief look at the code, it appears that CounterBuffer's state could become NaN in case diff_enabled > diff_running and diff_running == 0. We need to avoid those cases. Note that this happend on a Kernel, 5.19.1. Not sure what is special with time_enabled / time_runing for PMU counters - and not sure if that "swap" bug is still in effect. Maybe we can also simplify the code now and only consider non-broken kernels? In any case there should be logging / handling of cases that result in NaN.

cvonelm commented 1 year ago

As far as I'm concerned, we can delete that code path. Neither Google nor a search in the Linux kernel code finds anything regarding such an issue ever existing and git blame on the comment about the swap bug points to the initial import from svn.

tilsche commented 1 year ago

Given that all information points to it being fixed for at least 7 years (the git-preserved history of lo2s), I tend to agree.

But at the same time I have to admit that I am utterly curious what it is/was.