giampaolo / psutil

Cross-platform lib for process and system monitoring in Python
BSD 3-Clause "New" or "Revised" License
10.27k stars 1.38k forks source link

boot_time() is wrong #658

Open MauroMombelli opened 9 years ago

MauroMombelli commented 9 years ago

hi,

on a system without HW clock (raspberry) the boot_time() give 45 years (since epoch), while "uptime" command from shell give correct results.

giampaolo commented 9 years ago

What OS is that? Linux?

MauroMombelli commented 9 years ago

sorry, yes, arch linux arm, updated to latest version

giampaolo commented 9 years ago

Please paste the output of cat /proc/stat.

MauroMombelli commented 9 years ago

cat /proc/stat cpu 5896703 6291 435376 40166914 949451 381 55053 0 0 0 cpu0 5896703 6291 435376 40166914 949451 381 55053 0 0 0 intr 169712969 0 0 0 33510989 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 394073 158525 0 0 0 0 0 0 133113018 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 12 2 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 3 2536346 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ctxt 92586406 btime 1437869567 processes 7898 procs_running 1 procs_blocked 0 softirq 29134775 5943048 9746842 95726 2945587 142 0 7961860 0 22627 2418943

giampaolo commented 9 years ago

Uhm... that btime 1437869567 line looks correct:

>>> import datetime
>>> datetime.datetime.fromtimestamp(1437869567.0).strftime("%Y-%m-%d %H:%M:%S")
'2015-07-26 02:12:47'

Maybe your system clock was messed up?

MauroMombelli commented 9 years ago

the system clock is messed up because the raspi does not have the HW clock, so it start as timestamp=0 and then NTP set it to correct value. Now i can't replicate the problem, but i played a bit with NTP and systemd settings; and as i don't have physical access to the machine i won't play with it again :/

giampaolo commented 9 years ago

Since this is no longer reproducible I'm gonna close it for now. Feel free to reopen in case you bump into this issue again.

Gui13 commented 4 years ago

I have a pretty easy way to reproduce it here with telegraf, which uses the gopsutil library, which is pretty much a word for word rewrite of your library:

https://github.com/influxdata/telegraf/issues/7018

I have added a bug at gopsutil but your bug is pretty much the same: launch the python script before NTP does its magic, then wait until NTP corrects the date, and then enjoy your 40+ years uptime :D

giampaolo commented 4 years ago

Mmm... I guess that means that /proc/stat doesn't get updated for some reason. You should check whether also /proc/uptime is updated or not. If it's updated we can use that one.

Gui13 commented 4 years ago

This is what I get for a problematic Rpi:

pi@eanode5:~ $ cat /proc/uptime
11779.00 45685.18
pi@eanode5:~ $ cat /proc/stat
cpu  9382 37 8924 4689169 923 0 143 0 0 0
cpu0 2066 0 2431 1171940 258 0 83 0 0 0
cpu1 2058 2 2132 1173105 141 0 20 0 0 0
cpu2 2263 3 2249 1172836 276 0 18 0 0 0
cpu3 2995 32 2112 1171286 247 0 22 0 0 0
intr 952432 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 6690 7917 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 729 0 3763 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 269 7179 0 0 0 0 90339 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 526853 0 0 0
ctxt 1400316
btime 1581528798
processes 15829
procs_running 1
procs_blocked 0
softirq 811644 0 263970 2039 19548 0 0 86407 266106 0 173574
pi@eanode5:~ $ date +%s
1581540580

If you do the math, it actually makes sense, all the numbers are correct. btime has been adjusted by the NTP.

The issue is that you retrieve btime when your module is first instanciated and keep it for the rest of the duration of the process, so if the btime changes, you will do computations on an older value.

giampaolo commented 4 years ago

The issue is that you retrieve btime when your module is first instanciated and keep it for the rest of the duration of the process, so if the btime changes, you will do computations on an older value.

That is true, Process.create_time() on Linux relies on a cached version of whatever psutil.boot_time() returned the last time it was called. Maybe this should change. The original issue was about psutil.boot_time() though.

giampaolo commented 4 years ago

There is another flip of this coin which is probably worth mentioning. Process.create_time() is different than other methods because it's value is set on __init__ in order to identify the process instance uniquely over time. This way we avoid killing the wrong process if its PID has been reused. If the clock changes and create_time() returns something different, that Process instance will probably raise NoSuchProcess("this PID has been reused by another process") on kill(), terminate() and other "write" methods (set ionice() or whatever). Probably it's an edge case, and it's either that or live with process ctimes 40 years into the future.

The cost of giving up the cache and return the right process creation time after a system clock update is a 2x slowdown when creating a Process instance (after that the result is cached):

~/svn/psutil {master}$ python3 -m timeit -s "import psutil" "psutil.Process()"
10000 loops, best of 3: 25.2 usec per loop
~/svn/psutil {master}$ python3 -m timeit -s "import psutil" "psutil.Process()"
10000 loops, best of 3: 56.2 usec per loop
Gui13 commented 4 years ago

I understand the performance issue, but the returned value is still wrong for any system that encountered NTP adjustment during the life of the python process that uses psutil.

As I said in the original bug in telegraf (which uses the python gopsutil but has the same bug), I suspect that many monitoring systems might be minutes or hours off their real uptime, depending on how skewed their RTC is, and how often the NTP process has to compensate for it. Right now the only way to solve the issue is to restart the python process doing the monitoring.

Maybe there is a middle ground where the actual btime is read again at regular intervals (say, at least every 15 minutes) and cached in between?

giampaolo commented 4 years ago

I agree this should be fixed. We can resolve the slowdown by getting the boot time by using sysinfo(2) syscall instead of parsing /proc/stat: https://stackoverflow.com/a/1544090/376587 It's already exposed in psutil, we just have to return the extra struct info. I made a quick test calling sysinfo(2) on create_time() and the slowdown is minimal, so I'd say we can proceed with this:

before:
~/svn/psutil {master}$ python3 -m timeit -s "import psutil" "psutil.Process()"
10000 loops, best of 3: 23.5 usec per loop

after:
~/svn/psutil {master}$ python3 -m timeit -s "import psutil" "psutil.Process()"
10000 loops, best of 3: 25.8 usec per loop

Note to self: this issue is related to #1724.

giampaolo commented 4 years ago

Also, another note to self: originally we were already using sysinfo(2) to get the boot_time(), but decided to use /proc/stat instead so one can realiably determine the boot time of a Docker instance (when running psutil inside docker). There's a ticket somewhere but I'm lazy. Maybe it makes sense to rely on /proc/stat if we're running inside Docker, else use sysinfo(2).

giampaolo commented 4 years ago

sysinfo(2) gives the number of seconds since boot, which changes on every call, and that is not good as we have to do some math in order to calculate it, and that may lead to a variation of result between calls. We have a similar problem on Windows, which I solved with this hack: https://github.com/giampaolo/psutil/blob/7ce6c3c1626bd7101f69031e36b980639ca591d6/psutil/_pswindows.py#L435-L440 It appears that uptime -s returns a stable time, but still not sure how. This should be the source code: https://github.com/coreutils/coreutils/blob/master/src/uptime.c

Gui13 commented 4 years ago

@giampaolo if sysinfo returns the corrected value after an adjustment, then it makes sense yes.

If what is said in https://github.com/giampaolo/psutil/issues/1724 is right, you can bypass the problem for the process isRunning() issue by reading the /proc/PID/stat instead of basing on /proc/stat

Then the last thing to check is that in VM/docker environment, the sysinfo call correctly adjusts (but my guess is that yes, it adjusts, since in VM it is like in real machine and in docker it actually responds what the host kernel knows)

Gui13 commented 4 years ago

Ah my bad, sysinfo doesn't return the btime. Well then /proc/stat would be the solution.. maybe with some caching so that you don't re-read it all the time?

giampaolo commented 4 years ago

Well then /proc/stat would be the solution.. maybe with some caching so that you don't re-read it all the time?

The caching is why we have this problem currenly: since boot_time() value is cached (when using it from Process.create_time()) we don't get NTP updates.

Gui13 commented 4 years ago

Yeah but it might be sufficient to read it at regular intervals (every N calls, or store the last read timestamp and read every 15 minutes). The problem lies in reading it only once. If the uptime is eventually consistent, then it might be OK.

I'm just suggesting this because you mentioned the performance hit, the real, irrefutable fix is to read /proc/stat every time.