shirou / gopsutil

psutil for golang
Other
10.36k stars 1.57k forks source link

ProcTotal Inaccurate in `MiscStats` #1606

Closed eric1234 closed 4 months ago

eric1234 commented 5 months ago

Describe the bug

PR #682 added the ability to get the total number of processes via the MiscStats struct. It gets this information by reading the /proc/loadavg file which looks like:

0.35 0.22 0.18 2/1095 5671

The proc man page describes that file as:

The first three fields in this file are load average figures giving the number of jobs in the run queue (state R) or waiting for disk I/O (state D) averaged over 1, 5, and 15 minutes. They are the same as the load average numbers given by uptime(1) and other programs. The fourth field consists of two numbers separated by a slash (/). The first of these is the number of currently runnable kernel scheduling entities (processes, threads). The value after the slash is the number of kernel scheduling entities that currently exist on the system. The fifth field is the PID of the process that was most recently created on the system.

The bolded text is the key part. In my example above, it's showing 1095 as the total number of kernel scheduled entities. But note it uses that term instead of "processes". It even indicates in parentheses that "kernel scheduled entities" is both processes and threads. But by the name ProcTotal and the fact that all other numbers in MiscStat are processes only (not threads) this is implying I have 1095 processes when I do not. Thread count varies greatly. If I run it just a few minutes later it's showing 1127:

0.22 0.27 0.22 1/1127 5877

Also if we actually count the processes in my system we find we only have a bit over 300:

%> $ ps -A | wc -l
307

Expected behavior

I would expect ProcsTotal to return the total number of processes not processes and threads. I don't know if there is something else that can be read in /proc that gives just the total number of processes (I couldn't find anything) or if the attribute needs to be deprecated and removed for being in-accurate.

I came across this due to OpenTelemtry using this struct to help measure the number of processes in the system and getting wildly high numbers. It actually calculates this number itself by reading all directories under /proc that are number. But then it only uses that as a starting place and offsets it with what this struct is providing saying:

Processes are actively changing as we run this code, so this reason the above loop will tend to underestimate process counts. getMiscStats is a single read/syscall so it should be more accurate.

Based on that I don't think the OpenTelemetry strategy will be sufficient to get an accurate process count.

Environment (please complete the following information):

$ cat /etc/os-release
NAME="Pop!_OS"
VERSION="22.04 LTS"
ID=pop
ID_LIKE="ubuntu debian"
PRETTY_NAME="Pop!_OS 22.04 LTS"
VERSION_ID="22.04"
HOME_URL="https://pop.system76.com"
SUPPORT_URL="https://support.system76.com"
BUG_REPORT_URL="https://github.com/pop-os/pop/issues"
PRIVACY_POLICY_URL="https://system76.com/privacy"
VERSION_CODENAME=jammy
UBUNTU_CODENAME=jammy
LOGO=distributor-logo-pop-os
$ uname -a
Linux personal 6.6.10-76060610-generic #202401051437~1704728131~22.04~24d69e2 SMP PREEMPT_DYNAMIC Mon J x86_64 x86_64 x86_64 GNU/Linux
shirou commented 4 months ago

As I understand it, your argument is as follows.

The ProcsTotal of MiscStat returned by load.Misc() is supposed to return the number of Processes that do not include Threads by name, but actually returns a value that does include Threads. This is because /proc/loadavg is read in load.Misc(), and the value of this loadavg contains threads.

After reading man proc, I agree with you. However, it seems to me that the only way to get a pure process count that does not include threads is to count /proc as you write, or call the ps command. Is there any other way?

I will try to find a better way, but if you know of any, I would appreciate it if you could let me know.

Thank you.

eric1234 commented 4 months ago

However, it seems to me that the only way to get a pure process count that does not include threads is to count /proc as you write, or call the ps command. Is there any other way?

I don't think so. I also searched through the man page as well as did some general web searches and the general consensus is to count /proc as gopsutil is already doing with readPidsFromDir.

OpenTelemetry expressed concern about this strategy being accurate due to changing processes while the count is going on. This initially made me think this wasn't a viable strategy.

But it's also trying to collect status information on the pids as it's going. If we are just trying to get a process count then it's really just the one call to Readdirnames which if we dig down into go seems to be just a single syscall also so I think in this context it would be accurate.

I would be open to putting together a PR that refactors load.Misc to instead using the existing readPidsFromDir instead of the loadavg file. This should mean gopsutil provides the same interface externally but will report an accurate total process count. Let me know if that sounds good and I'll work on it.

shirou commented 4 months ago

Great to hear that! There does not seem to be a way to get the exact value in the proc file system, using readPidsFromDir seems like a good way to get number of processes.

I think it would be better to move readPidsFromDir() to internal/common/common_linux.go and call it from the process and load packages. If you could create a PR, it would be greatly appreciated. Thank you!

eric1234 commented 4 months ago

Set down to implement this and it looks like there is already a different method in common_linux.go that counts the number of processes by reading the directories entries that look like a number. NumberProcWithContext is very similar to the readPidsFromDir. I think the only main difference is the former is based around a context while the later is based around a path.

I'm going to work on making getProcsTotal read from the existing NumberProcWithContext already in common. That will fix my issue. Then as a separate bit of work if we wanted to DRY up NumberProcWithContext and readPidsFromDir that could be done.