Inconsistent behaviors of computing CPU% when time "delta" is zero

Explorer09 commented 1 year ago

I originally intended to address this issue in pull request #1271 before I found out the issue is too big to handle in a single commit.

The CPU percentages of the machine and processes are usually computed through a formula like this: (float) (newUsedTime - oldUsedTime) / (newElapsedTime - oldElapsedTime) * 100.0 The denominator represents a difference between the current time and the timestamp of last measurement. However, when this time difference is zero, platform codes of htop handle them differently, and some of them are a bit buggy.

Here is a summary of the behaviors of different htop platform codes:

DragonFlyBSD & FreeBSD
- When delta time is zero, the machine's CPU percentage is computed as if delta time is one.
- A process's CPU time is reported in percentage, but it requires a division by a fixed scale (kernelFScale) which is not checked for zero.
NetBSD & OpenBSD
- When delta time is zero, the machine's CPU percentage is computed as if delta time is one.
- A process's CPU time is reported in percentage. If fscale is zero, the percentage is set to zero.
Darwin
- The machine's CPU percentage may be infinity or NaN (division by zero).
- When delta time is zero, the process's CPU percentage is set to zero.
Linux
- When delta time is zero, the machine's CPU percentage is computed as if delta time is one.
- When delta time is zero, the process's CPU percentage is set to zero.
PCP
- When delta time is zero, the machine's CPU percentage is computed as if delta time is one.
- When delta time is zero, the process's CPU percentage may be either 100 or 0 due to clamping. Division by zero could happen, but the NaN value would be caught by the isnan check.
Solaris
- The machine's CPU percentage may be infinity or NaN (division by zero).
- A process's CPU time is reported in percentage. The scale to divide is hard coded so no division by zero could happen.

Source code references (I use commit 2d4e5cba225439217464938bf7eaed277427e12d from main as a base; for convenience this page marks all lines that are referenced):

dragonflybsd/DragonFlyBSDMachine.c line 204
dragonflybsd/DragonFlyBSDProcessList.c line 217
dragonflybsd/DragonFlyBSDMachine.c line 108
freebsd/FreeBSDMachine.c line 231
freebsd/FreeBSDProcessList.c line 238
freebsd/FreeBSDMachine.c line 139
netbsd/Platform.c line 242
netbsd/NetBSDProcessList.c line 147
openbsd/Platform.c line 195
openbsd/OpenBSDProcessList.c line 127
darwin/Platform.c line 269
darwin/DarwinProcess.c line 378
linux/Platform.c line 310
linux/LinuxProcessList.c line 1483
pcp/Platform.c line 490
pcp/PCPProcessList.c line 379
solaris/SolarisMachine.c line 130
solaris/SolarisProcessList.c lines 192 and 221

Yes. Platform codes have different behaviors when the time "delta" in the denominator is zero. The result I wish is all platform codes behave consistently in this case, with explicit checks on whether the time "delta" is zero. The time "delta" being zero is not necessary a bug (it can theoretically happen when the measures are made too fast; faster than the OS can update the process status).

Update: Sorry for not explaining this clear enough (I was sleepy when I wrote the initial version of this report). There are several approaches for the case where the "delta time" is zero:

Make it an assert() error, but skip the check in non-debug builds of htop.
Allow the delta time to be zero, but not compute or update the percentage values in that case. Keep the same percentage values as the last measure time.
Set the percentages to a fixed value when the delta time is zero. The percentage value can be zero or NaN, whichever makes sense.

The goal was to eliminate the division by zero possibly, so that unnecessary isnan checks can be removed (I did #1271 to replace isnan with better alternatives). htop uses NaN values mainly for "data not available", "not supported by machine or OS" or "error retrieving data", and not for indicating arithmetic errors (such as division by zero). So the possibility of NaN be slipped from here looks more like a bug to me. By the way, I prefer the approach no. 2 if I can choose.

natoscott commented 1 year ago

FWIW, I suspect this is a case of "perfect is the enemy of good".

I cannot see any situation where the delta can be zero ... AFAICS it will not happen in htop, have you observed this?

Even if we make two calls to clock_gettime immediately one after the other (which we certainly never do, there's always heaps of time between them in practice), we'll still see a small time difference and small delta value.

IMO this issue should just be closed as its just likely to cause code churn - best not to chase our tails on things that can "theoretically happen" when in practice they cannot.

Explorer09 commented 1 year ago

@natoscott There is not even an assertion for that. And the time source isn't just clock_gettime. There are other time sources where "less than one unit of time difference" is possible between two measures.

The first part of the code where I found "less than one unit of time difference" is of concern is the handleNetlinkMsg() function in linux code - where stats.ac_etime is in microseconds but stats.cpu_delay_total and like are in nanoseconds, and because of the lack of zero time difference check, isnan is used to catch that after the fact which make the code more ugly.

natoscott commented 1 year ago

@Explorer09 I'm not sure I understand what you mean here - you're saying "there is not even an assertion for that" but in the original post you stated ~10 times "when delta is zero". So, AIUI, there is definitely an assertion that delta will be zero ... am I misunderstanding?

In the netlink case where a unit conversion between usec and nsec is needed, it sounds like it would be simpler to just multiply the usec value by 1000 so we're on common units, and still with no possibility of a zero delta.

You understand my concern that this is just code churn, right? No actual problem is being solved here AFAICT, so we probably should not be doing this unless it intrinsically makes the code better (more efficient?) in some other way. But I'm just not really seeing a convincing argument for that so far (I'm happy to be convinced though), the rationale seems to be based on "when delta is zero" which as I say, I don't think can actually happen.

Explorer09 commented 1 year ago

Update: Sorry for not explaining this clear enough (I was sleepy when I wrote the initial version of this report). I mean there are several approaches for the case where the "delta time" is zero:

Make it an assert() error, but skip the check in non-debug builds of htop.
Allow the delta time to be zero, but not compute or update the percentage values in that case. Keep the same percentage values as the last measure time.
Set the percentages to a fixed value when the delta time is zero. The percentage value can be zero or NaN, whichever makes sense.

The goal was to eliminate the division by zero possibly, so that unnecessary isnan checks can be removed (I did #1271 to replace isnan with better alternatives). htop uses NaN values mainly for "data not available", "not supported by machine or OS" or "error retrieving data", and not for indicating arithmetic errors (such as division by zero). So the possibility of NaN be slipped from here looks more like a bug to me. By the way, I prefer the approach no. 2 if I can choose. The check of zero divisor is trivial and it's just a compare and a branch instructions to be added in code, and I am happy to work with that.

natoscott commented 1 year ago

@Explorer09 of those three, I'd vote for option 1 I guess. There's also option 0, which I also very much support, which is "no changes" since there seems to be no actual issue being addressed here AFAICS.

BenBE commented 1 year ago

I think, – and that's why I asked to split this from the original issue –, that having various platforms behave differently for what should be the same operation (get CPU usage, get Process usage) should be consistent. As listed in the initial post, this currently is not the case.

Regarding the question, what the "convergent" behaviour should be in the end, I prefer an approach, where the visible result is sensible and comprehensible to the user.

Based on the various implementation mentioned at the top, I think the options for handling, when the delta time is less than some minimal interval, are:

Machine CPU usage:
1. assume that minimal interval passed
2. short-circuit the result to NaN or zero
3. propagate the arithmetic error (not an option)
Process CPU usage:
1. assume minimal interval passed, use sane/common default for the scaling factor
2. short-circuit the result to NaN or zero

While IMO assuming a minimal interval (of 1ms, which is plenty for htop) is possible, the problem is that values measured on those short timescales are numerically unstable. Thus you are likely to risk showing wildly inaccurate data. Thus a consistent way to go forward here is actually to drop such measurements, which would mean to explicitly mark them as NaN, as no reliable measurement is available. While using zero here could be considered, you would be having the risk of showing conflicting data all over again. Nothing is worse than confidently showing 0% CPU usage while directly below a bunch of processes are shown to be churning away at your CPU at 100% …

TL;DR: If the delta time is below a minimal interval of 1ms reject such values and report NaN instead.

Explorer09 commented 1 year ago

I disagree with the assumption that "there is no issue", because there are other parts of htop code that check if the time "period" is zero, and for the "option 0" as @natoscott mentioned, the ideal would be to remove those checks as well.

Commit d1db9da936630848ebe164e4bdf52155d65b15f8 (in Linux code) is one of them that addressed the case that the delta time could be zero, and it's better to address all of such cases consistently.

By the way, it's me that split this issue from #1271 because I think this issue deserves more discussion and is not a trivial task to make it with that pull request.

As for @BenBE 's comments… Yes, propagating the arithmetic error sounds like the worst solution, because that would end up multiple items (CPU times) being in 100% (clamped from infinity), confusing the user more. Having a NaN result sounds a bit better as it's just a short moment that represents "data not available; will try measuring again some milliseconds later". And one more advantage to that approach is the error would be visible to user (and a non-fatal one). Thanks for the suggestion.

And you guys may have noticed I proposed #1275 to reject the zero fixed scale (for BSD platforms).

natoscott commented 1 year ago

Commit https://github.com/htop-dev/htop/commit/d1db9da936630848ebe164e4bdf52155d65b15f8 (in Linux code) is one of them that addressed the case that the delta time could be zero, and it's better to address all of such cases consistently.

But this commit is not dealing with the time delta we've been discussing. Its a weak comparison (ticks vs time) - one is a Linux-specific corner case where we know full well the ticks-based denominator can be zero - whereas here we've been discussing time delta between consecutive clock_gettime(2) samples with a substantial delay, where we always know the time-difference denominator is non-zero.

I'm concerned about sprinkling dead code throughout htop, which is what these additional branches would amount to. Usually we remove dead code whenever we find it.

Explorer09 commented 1 year ago

But this commit is not dealing with the time delta we've been discussing. Its a weak comparison (ticks vs time) - one is a Linux-specific corner case where we know full well the ticks-based denominator can be zero - whereas here we've been discussing time delta between consecutive clock_gettime(2) samples with a substantial delay, where we always know the time-difference denominator is non-zero.

For me it's the same issue, that htop doesn't handle external data properly. Data gathered from external sources should be assumed untrustworthy, so we are supposed to filter unlikely input, cutting corner cases no matter how unlikely it can happen.

I'm concerned about sprinkling dead code throughout htop, which is what these additional branches would amount to. Usually we remove dead code whenever we find it.

Yes, such branches might look "dead" or very unlikely to happen, and I am trying not to introduce anything substantial to them. Usually it's about setting variables to sane values, or jump to an error branch that already exists (the approach I picked in the PR #1275 )

htop-dev / htop

Inconsistent behaviors of computing CPU% when time "delta" is zero #1274