[Windows] `process_iter()` is 10x slower when running from non-admin account

giampaolo / psutil

Cross-platform lib for process and system monitoring in Python

BSD 3-Clause "New" or "Revised" License

10.29k stars 1.38k forks source link

[Windows] `process_iter()` is 10x slower when running from non-admin account #2366

Closed smihaila closed 3 weeks ago

smihaila commented 9 months ago

Summary

OS: Windows 11
Architecture: 64bit
Psutil version: 5.9.8
Python version: 3.12.1
Type: performance

Description

Running the following Python code from a non-admin Windows user account, takes about 400ms. And when running the same code from a the same Windows user account, but through an ADMIN/elevated cmd.exe command prompt, it takes about 38-40ms, which is 10x faster.

The total exec time does not seem to be influenced by whether the process searched for, is currently running or not. Also, when the process searched for is running, it's always only 1 instance of it, so no multiple processes of the same name.

import psutil

p: psutil.Process | None = next(
    (p for p in psutil.process_iter(attrs=["name"]) if p.name() == "dbengprx.exe"),
    None)

The code above is invoked indirectly, via Uvicorn ASGI web application server (configured with only 1 worker) and FastAPI web api framework.

Same machine, same Virtual Mem usage, same list or processes between the two cases. About 220 processes showing in TaskMgr / tasklist, in both cases.

There is something that makes psutil's Process item generator go 10x slower in non-admin mode, than in admin mode.

Thank you.

tduarte-dspc commented 8 months ago

On my Windows 11 machine, I get 5 to 8 seconds to iterate over 427 processes. I'm running Python 3.11.3. It's a company laptop but I'm the administrator. It's a 12th Gen Interl i7-12800H

import time
from typing import Tuple

import psutil

def proc_by_name(process_name: str) -> Tuple[bool, int]:
    start = time.time()
    for proc in psutil.process_iter(attrs=["name"]):
        if process_name in proc.name():
            print("On success took:", time.time() - start, "seconds")
            return True, proc.pid

    print("On failure took:", time.time() - start, "seconds")
    return False, 0

if __name__ == "__main__":
    running, pid = proc_by_name("unknown.exe")
    print(f"unknown.exe: {running=}, {pid=}")

    running, pid = proc_by_name("chrome.exe")
    print(f"chrome.exe: {running=}, {pid=}")

    print(len(list(psutil.process_iter(attrs=["name"]))))

An iteration on the terminal:

On failure took: 5.50777530670166 seconds
unknown.exe: running=False, pid=0
On success took: 0.7929043769836426 seconds
chrome.exe: running=True, pid=3024
427

giampaolo commented 8 months ago

This is known. Certain APIs have 2 implementations, a fast one and a slow one.

The fast one is attempted first, but requires more privileges, and hence often fail with AccessDenied for processes not owned by the current user or with low PIDs. The slow implementation is used as fallback: it's slower but it manages to return info instead of raising AccessDenied. This is the reason why running a benchmark as a super user vs normal user produces different results. This is the best psutil can do in this regard, and there's nothing we can do about it (well, by installing psutil as a service / driver perhaps that'd be possible, but that's another story).

Some examples of "dual" APIs are:

smihaila commented 8 months ago

Thank you, @giampaolo . I wasn't aware of a dual implementation driving parts of the psutil package. Now that you were explaining it, it makes perfect sense.

Now, assuming that the process I wish to test the existence for (and getting additional info from, such as virtual mem usage metrics, or say IPv4 TCP sockets opened), is always owned by the same user account invoking psutil (with 2 sub-cases: such user account being LOCAL SERVICE or a normal non-system user account), I have a rather stupid question:

Is there a way to check solely for such process, and get info about it in a faster way than psutil.process_iter() generator + filtering logic? We know the faster win32 API is always leveraged in such case, but can it be made even faster, by querying only for a specific process name? Or would the perf gain be minimal in a "more focused" query? It's like the difference at Win32 API / C++ level between finding a process by name, vs. enumerating all processes, and which is not non-negligible.

As @tduarte-dspc just exemplified very concisely (and even when running under an admin account, which presumably engages the faster API even if such account is not LOCAL SERVICE or NT AUTHORITY\SYSTEM), enumerating all running processes, to arriving at a negative / not found case, is always sensibly slower than the positive / process found case. So, can the response time be made deterministically constant in both the "not found" and "found" case?

Thank you.

giampaolo commented 8 months ago

It depends on what criteria you use to identify the process you're looking for. Is it based on cmdline() (which has dual implementation)?

E.g. cmdline() has dual implementation, but username() doesn't. Assuming username() never fails with AccessDenied (which I don't know), and assuming you pre-emptively know that the process you're looking for is owned by your user, perhaps you can do something like (not tested):

import psutil, os

myuser = os.getlogin()
mycmdline = ["C:\\python310\\python.exe", "foo.py"]

for p in psutil.process_iter():
    try:
        if p.username() == myuser and p.cmdline() == mycmdline:
            print(f"found {p}")
    except psutil.Error:
        pass

With that said (mostly note to self): it would make sense to debug-log APIs which use the dual implementation, so one can identify performance bottlenecks by running psutil in PSUTIL_DEBUG mode: https://psutil.readthedocs.io/en/latest/#debug-mode

smihaila commented 8 months ago

Well, my question was mostly about finding a way to avoid iterating through the list of all processes, i.e. how to avoid for p in psutil.process_iter(): [...], via some hypothetical psutil.get_process_info(processName).

Probably it's not supported in the current psutil implementation. That's fine @giampaolo , and thank you for what you are doing, and for everyone's contribution to this project.

Within everyone's agreement, I'll close this issue, since it's proven to work as designed, and it's not a defect.

Thanks again for everybody's time, and all the best.

iglendd commented 1 month ago

Hi @giampaolo I am not 100% sure if it is related but "simple" process enumeration list(psutil.process_iter(attrs=['pid'])) under certain circumstances (non-admin and within a service), triggers execution of the "slow" function psutil_get_proc_info appear to be for every process. I confirm it in the debugger and it makes that enumeration very slow.

However, it is not clear how the flow gets there. I can see from the code and you confirm it above, additional attributes collection may trigger that if not high-privileged process but simple enumeration does not appears to be doing that yet in the debugger I can see it. Because I do not have debug symbols I cannot say what call that function.

The problem with that that on a machine with 400 processes under non-admin service process iteration is 200x slower. If from 5.9.0 psutil I switch to 6.0.0 (I Have seen release not which stated 20x improvement) and I actually see it, but again I see ONLY 20x improvement, 10x overhead is still there.

So it is not clear what and why in psutil.process_iter(attrs=['pid']) would call a slow function. Maybe you have some idea? Perhaps there are debug symbols which would help me.

Thank you.

giampaolo commented 1 month ago

Hi @giampaolo I am not 100% sure if it is related but "simple" process enumeration list(psutil.process_iter(attrs=['pid'])) under certain circumstances (non-admin and within a service), triggers execution of the "slow" function psutil_get_proc_info appear to be for every process. I confirm it in the debugger and it makes that enumeration very slow.

If never used before, psutil.process_iter() creates a Process() instance for every PID, which internally retrieves the process creation time (see source).

Process creation time on Windows uses a dual-implementation (see my previous comment). If the first (fast) method fails due to insufficient permissions, a second (much slower) method is attempted (source).

If you iterate over psutil.process_iter() the second time the creation time won't be fetched again because is cached. This means that things are slow only for scripts that iterate over all PIDs once (which is very common) instead of in a loop / htop style, as in:

for p in psutil.process_iter():
    if p.name() == "myapp.exe":  
        print("found it!")
        break

With that said, I recently bumped into a comment on X, which made me realize that this is more serious than I though: https://x.com/adrianthonig/status/1830946204298952966?s=46&t=kzFa9FOgZhZunDU2HZ4TTA The pain seems real. Perhaps we could work around this by NOT invoking the "slow method" in Process.__init__. This may (or may not) have some repercussions on is_running() implementation though, because it relies on process creation time, so the change is not so obvious.

The problem with that that on a machine with 400 processes under non-admin service process iteration is 200x slower. If from 5.9.0 psutil I switch to 6.0.0 (I Have seen release not which stated 20x improvement) and I actually see it, but again I see ONLY 20x improvement, 10x overhead is still there.

The speedup you're seeing is due to https://github.com/giampaolo/psutil/issues/2396 (merged in 6.0.0). Basically process_iter() used to call create_time() twice per PID, now only once.

ThoenigAdrian commented 1 month ago

I ran some tests to confirm your theory. I modified the create time to have some timing outputs and a way to know which path is taken.

    @wrap_exceptions
    def create_time(self):
        import time
        # Note: proc_times() not put under oneshot() 'cause create_time()
        # is already cached by the main Process class.
        try:
            start = time.time()
            user, system, created = cext.proc_times(self.pid)
            print("NOT USING FALLBACK FASST")
            print(time.time() - start)
            return created
        except OSError as err:
            if is_permission_err(err):
                start = time.time()
                print("USING FALLBACK SLOWW")
                x =  self._proc_info()[pinfo_map['create_time']]
                print(time.time() - start)
                return x
            raise

USING FALLBACK SLOWW
0.15603852272033691
USING FALLBACK SLOWW
0.06301569938659668
USING FALLBACK SLOWW
0.06301546096801758
NOT USING FALLBACK FASST
0.0
USING FALLBACK SLOWW
0.062015533447265625
USING FALLBACK SLOWW
0.12203025817871094
USING FALLBACK SLOWW
0.0760183334350586
NOT USING FALLBACK FASST
0.0
USING FALLBACK SLOWW
0.06147432327270508
NOT USING FALLBACK FASST
0.0
NOT USING FALLBACK FASST
0.0
USING FALLBACK SLOWW
0.05701589584350586
NOT USING FALLBACK FASST
0.0
USING FALLBACK SLOWW
0.060014963150024414
NOT USING FALLBACK FASST
0.006000995635986328

So it seems for some processes it uses the fallback method for others it doesn't. When it's using the fallback method it takes indeed way longer. Presumably when the processes are higher privileged ones (not sure) ?

@giampaolo Do you have an idea why the fallback method is so slow ? I understand your concern about the is_running method depending on the creation time of the process and therefore needing this information. So one idea would be to make the fallback method faster or replacing it by a different(faster) implementation.

Alternatively you could create a process_iter_fast method where you can print a warning the is_running method might not work but people whose use case doesn't need it have a faster alternative.

If you're curious here is my ctypes implementation I'm using right now.

https://gist.github.com/ThoenigAdrian/b12bb7e6c438fd4f7a7e56c67a294484

import ctypes
import ctypes.wintypes

# Load the required libraries
psapi = ctypes.WinDLL('Psapi.dll')
kernel32 = ctypes.WinDLL('kernel32.dll')

# Define constants
PROCESS_QUERY_INFORMATION = 0x0400
PROCESS_VM_READ = 0x0010
MAX_PATH = 260

def get_pids_by_name_fast(process_name):
    process_name = process_name.encode('utf-8')
    pids = []

    # Allocate an array for the process IDs
    array_size = 1024
    pid_array = (ctypes.wintypes.DWORD * array_size)()
    bytes_returned = ctypes.wintypes.DWORD()

    # Call EnumProcesses to get the list of process IDs
    if not psapi.EnumProcesses(ctypes.byref(pid_array), ctypes.sizeof(pid_array), ctypes.byref(bytes_returned)):
        raise ctypes.WinError(ctypes.get_last_error())

    # Calculate the number of processes
    num_pids = bytes_returned.value // ctypes.sizeof(ctypes.wintypes.DWORD)

    # Iterate over all the process IDs
    for pid in pid_array[:num_pids]:
        # Open the process with necessary privileges
        h_process = kernel32.OpenProcess(PROCESS_QUERY_INFORMATION | PROCESS_VM_READ, False, pid)
        if h_process:
            exe_name = (ctypes.c_char * MAX_PATH)()
            h_module = ctypes.wintypes.HMODULE()
            needed = ctypes.wintypes.DWORD()

            # Get the first module, which is the executable
            if psapi.EnumProcessModules(h_process, ctypes.byref(h_module), ctypes.sizeof(h_module), ctypes.byref(needed)):
                psapi.GetModuleBaseNameA(h_process, h_module, ctypes.byref(exe_name), ctypes.sizeof(exe_name))
                if exe_name.value.lower() == process_name.lower():
                    pids.append(pid)

            kernel32.CloseHandle(h_process)

    return pids

# Example usage:
process_name = "python.exe"
matching_pids = get_pids_by_name_fast(process_name)

print(f"PIDs for processes named '{process_name}': {matching_pids}")

iglendd commented 1 month ago

@ThoenigAdrian the fallback method is very slow because it effectively asks kernel to retrieve ALL processes running in the system with many details. Some of it as @giampaolo said is cached. I do see in 6.0.0 the very first iteration takes 30 seconds and subsequent iterations take "only" 1.5 seconds.

To my code the logic of getting process times for every process is unfortunate since

we have our own map of process id to process name which we fill up with (only seemingly and innocuous) process psutil.process_iter(attrs=['pid', 'name']. Ideally and in theory that kind of information does not need privilege and or fallback on slow query.
Since we track only some of the processes by name we can check our map and only then try to collect information which could could be Ok to be a bit slower since only to a handful of process, not all, we need to take that information.
Moreover the processes which usually tracked are not high privilege processes

The bottom line for our kind of logic current implementation of process iteration under lower privilege process is a huge waste. I guess changing that iteration now impossible because it would break old code and lot of existing logic. Perhaps there could be alternative method to get a list of pid->name map? We could of cause call Windows APIs directly but it is kind of defeat the purpose and convenience of psutil.

iglendd commented 1 month ago

And thank you very much for reply :bow

iglendd commented 1 month ago

@ThoenigAdrian I have replied before I have read your whole comment. Regarding of your get_pids_by_name_fast I have discovered during this investigation an interesting function, syscall actually, used by psutil when process name is retrieved by pid completely WITHOUT traditional opening process handle. It is used in the Process's exe() method (which calls psutil_proc_exe() function). I am not sure if it is faster than opening a process and getting its name via classical API call. However, I feel that the method is impractical as a solution, since in order to call it one needs to instantiate the Process by ID , which is the root cause of performance bottleneck (or could be in the context of this discussion).

Also @ThoenigAdrian your get_pids_by_name_fast function is effectively implements my comment suggestion 👍

Perhaps there could be alternative method to get a list of pid->name map? We could of cause call Windows APIs directly but it is kind of defeat the purpose and convenience of psutil.

Would be interesting to hear @giampaolo thought if that style of process data collection can be naturally and organically added to the existing API styles 🙇

iglendd commented 1 month ago

@giampaolo I want to mention one more thing which I do not know if it is relevant to the way you think about these issues. Per your comment above

This means that things are slow only for scripts that iterate over all PIDs once (which is very common) instead of in a loop / htop style, as in:
for p in psutil.process_iter():
if p.name() == "myapp.exe":  
print("found it!")
break
This approach indeed could help in many cases but in cases when you want to collect information for all chrome.exe processes, e.g., it would not work unfortunately.

clarkb7 commented 1 month ago

So it seems for some processes it uses the fallback method for others it doesn't. When it's using the fallback method it takes indeed way longer. Presumably when the processes are higher privileged ones (not sure) ?

create_time calls OpenProcess(PROCESS_QUERY_LIMITED_INFORMATION), but non-admin users don't have this right for processes that are running as different users. Whereas Administrators are granted this for other processes.

So when run as non-admin in your Desktop session you'll get ACCESS_DENIED for all the service processes (and any other processes running in a different session or as a different user). And when run as non-admin user as a service you'll get ACCESS_DENIED for all the service process AND all the Desktop processes. Each access denied calls the "slow fallback". The problem is even worse on terminal services servers where many users are logged into different sessions, and there are many more processes that will return ACCESS_DENIED.

It seems like the execution time could be improved if the NtQuerySystemInformation(SystemProcessInformation) fallback was able to be cached at a higher level than per Process object. Maybe process_iter could call it once and then feed the result to the Process objects it creates?

giampaolo commented 1 month ago

Perhaps there could be alternative method to get a list of pid->name map?

Would be interesting to hear @giampaolo thought if that style of process data collection can be naturally and organically added to the existing API styles 🙇

-1

Adding a function returning a pid->name mapping would probably cover the most common use case, but it wouldn't be a generic enough solution. E.g. one may want to filter for name + username. Adding exe and cmdline to the mix is also common, see examples in doc: https://psutil.readthedocs.io/en/latest/#recipes

To clarify: we fetch create_time in Process.__init__, and for all PIDs, for 2 reasons:

to raise NoSuchProcess if PID does not exist
to store the creation time in order to recognize PID reuse later (if ever)

If psutil.process_iter() is used only once, or if PID reuse is never needed (aka, you'll never call kill() or similar methods), fetching the creation time in Process.__init__ is effectively unnecessary.

Perhaps a possible solution would be adding a new check_pid_reuse=True parameter to psutil.Process() and psutil.process_iter(). If False, create_time won't be fetched, and PID existence could be checked via psutil.pid_exists(), or not checked at all via other means (e.g. a special int-like PID arg passed by process_iter() and recognized by Process.__init__).

Perhaps the new paramenter may even default to False (don't check for PID reuse). I got this feeling after I discovered that htop itself does not care about checking for PID reuse on kill(), see discussion started by me at https://github.com/htop-dev/htop/issues/1441. Maybe psutil should not care either, at least by default, since it's much more common for psutil users to just read process info (name(), cpu_times(), etc.) instead of using "write" methods like kill(), set cpu_affinity() etc.

The doc may clarify this by stating:

<<If you plan on using kill(), terminate(), ... methods of the Process() class, use psutil.Process(pid, check_pid_reuse=True) or psutil.process_iter(check_pid_reuse=True). This will guarantee that you won't accidentally kill or interact with the wrong PID>>

I see 3 downsides of this solution, not really real blockers, just mentioning them for completeness and personal brainstorming:

1) a new parameter makes the API more complicated, but I can live with that 2) a check_pid_reuse parameter sounds like PID reuse is checked for all Process methods, while it's not 3) perhaps it would require a major version bump (7.0.0)

Any comment is welcome =)

iglendd commented 1 month ago

@giampaolo thank you for your openness and invitation for a dialog. I probably will need to sleep on it to have a more sensible reply but here are my few cents (please forgive me I am thinking aloud).

I read htop discussion thread and a comment that in theory OS is not inclined to reuse PID if the process is gone and started in a quick succession striked me as wishful thinking. I am not sure about Linux but on Windows (and anecdotally) I have seen cases when this "hope" is patently wrong. With some tests which create and destroy processes with moderate speed I saw in the past quite frequent and rapid PID reuse.
I am also "challenging" the utility of collecting process times for the sake of killing the "right" process. And not only on the basis of the htop discussion thread comment which mentions impossibility to enforce 100% certainty of killing the same process, which is true, but also because of direct implications of impossibility to get process time via fast API. In short my point is that if we cannot get process times because we cannot open the handle with very limited rights then most likely we cannot kill it anyway and the fear (and the argument) of killing the wrong process, at least in this specific case is unfounded.
Regarding the suggested check_pid_reuse it is a reasonable approach. But please again consider an alternative point. Let's say we add a new primitive called ProcessName (perhaps there is a better name) which specializes in rapid process name enumeration implicitly. It is fast and easy to use. It is very limited on how to enumerate, sort or filter yet. However, it can be plugged into Process list/universe via a single API call e.g. which would utilize existing code 100% without breaking or any changes. Accordingly you get the best of both worlds, IMHO.
Last point which is not directly related to the subject but worth mentioning just in case. In Windows 10 Redstone 2 (April 2017 I think) Microsoft has added in the kernel ProcessStartKey field. I think they represent a truly monotonic and unique number/time from the start OS boot time. From user mode you can access it via process handle either with ProcessTelemetryIdInformation or via ProcessSequenceNumber.

Thank you and best regards 🙏.

giampaolo commented 1 month ago

I read htop discussion thread and a comment that in theory OS is not inclined to reuse PID if the process is gone and started in a quick succession striked me as wishful thinking. I am not sure about Linux but on Windows (and anecdotally) I have seen cases when this "hope" is patently wrong. With some tests which create and destroy processes with moderate speed I saw in the past quite frequent and rapid PID reuse.

Interesting. This message seems to confirm what you say https://superuser.com/a/937134, but it's unclear how he tested this. Also it's unclear after how much time (if any) Windows can re-assign the same PID, which is the key point here.

To clarify: on Linux the creation time has a 2-digits precision (.66 in this example):

>>> import psutil
>>> psutil.Process().create_time()
1727641709.66
>>>

That means that if PID disappears at 0.66 seconds, and a new process with the same PID appears at 0.67 seconds (0.01 secs later), then psutil will be able to detect it's a different process, and hence it won't allow killing it. To put it another way, the (admittedly unverified) assumption here is that the Linux kernel won't recycle the same PID in such a short time (0.01 secs), but it will pick up a different PID instead.

psutil makes the same assumption on Windows, but right now I can't check what's the time precision there, nor we know how PIDs are assigned exactly (couldn't find any useful info).

It must be noted that the number of maximum PIDs also matters here: when the OS runs out of PIDs it will restart from 0. Therefore the smaller the max-PID, the more likely it is to hit the 0.01 secs window mentioned above, and thus breaking psutil algo. FWIW, on Ubuntu 22.04 max PID is 4.1 millions, which appears quite high (again, no info about Windows):

$ command cat /proc/sys/kernel/pid_max
4194304

So yes, psutil algo is technically racy, but practically speaking it should be "good enough" in most cases, and "better than nothing" in the worst case.

In short my point is that if we cannot get process times because we cannot open the handle with very limited rights then most likely we cannot kill it anyway and the fear (and the argument) of killing the wrong process, at least in this specific case is unfounded.

Very good point, and I agree with you. For the time being I think the quicker way to solve this issue on Windows is to only use the "fast" create time method in Process.__init__. All ADMIN processes will have the creation time unset due to Access Denied, but that is fine because you won't be able to kill() them anyway. So in the end it won't make any difference.

Last point which is not directly related to the subject but worth mentioning just in case. In Windows 10 Redstone 2 (April 2017 I think) Microsoft has added in the kernel ProcessStartKey field.

Excellent, thanks for letting me know. Since on Windows it's less clear how PIDs are assigned, this API looks particularly useful. We may determine API existence at runtime and do (in pseudo code):

def unique_process_ident(pid):
    if WIN_VER >= 10:
        return (pid, ProcessStartKey(pid))
    else:
        return (pid, fast_creation_time(pid))

Let's say we add a new primitive called ProcessName (perhaps there is a better name) which specializes in rapid process name enumeration implicitly.

Are you proposing something like this?

>>> psutil.pids_names_map()
{1: "foo.exe", 2: "bar.exe", ...}

Please note that if we use the fast create time method in Process.__init__ as I described above the slowdown issue should already be solved. That would make psutil.pids_names_map() basically as fast as list(psutil.process_iter(["name"])).

giampaolo commented 1 month ago

For the time being I think the quicker way to solve this issue on Windows is to only use the "fast" create time method in Process.init. All ADMIN processes will have the creation time unset due to Access Denied, but that is fine because you won't be able to kill() them anyway. So in the end it won't make any difference.

I have created PR https://github.com/giampaolo/psutil/pull/2444 which implements exactly this. This should solve the severe performance issue described in here.

@ThoenigAdrian if you have a chance to test this PR please report back here, but you'll need Visual Studio installed in order to compile psutil, I believe Github CI also stores the binary wheels somewhere but can't remember where.

iglendd commented 1 month ago

I read htop discussion thread and a comment that in theory OS is not inclined to reuse PID if the process is gone and started in a quick succession striked me as wishful thinking. I am not sure about Linux but on Windows (and anecdotally) I have seen cases when this "hope" is patently wrong. With some tests which create and destroy processes with moderate speed I saw in the past quite frequent and rapid PID reuse.

Interesting. This message seems to confirm what you say https://superuser.com/a/937134, but it's unclear how he tested this. Also it's unclear after how much time (if any) Windows can re-assign the same PID, which is the key point here.

Indeed. In a previous cybersecurity company, PID reuse triggered an internal kernel process cache bug, and my coworker investigated it by creating a tight loop of process creation, where each process exited immediately (this was a while ago). Within minutes, he encountered multiple collisions, if my memory serves me right.

Without consulting Microsoft or reverse-engineering Windows, it is hard to determine the likelihood or chances of PID reuse per unit of time under different load levels. One thing is certain, though: we cannot infer the minimum granularity of time resolution, which, if I understand your reply correctly, could be done. Windows time-related APIs in user mode (outside of a few C-runtime wrappers) and 100% in the kernel use the FILETIME structure, which provides time in 100-nanosecond intervals (and this is documented here: https://learn.microsoft.com/en-us/windows/win32/api/minwinbase/ns-minwinbase-filetime). I don't believe this is guaranteed under all circumstances, but it's the theoretical foundation.

Additionally, in systems where numerically based resources are allocated and released in a stack-like manner, rapid reuse of numbers when things are allocated and released quickly could result in giving back the same resource/number, unless explicit measures are taken to avoid it. I wouldn't necessarily say this is uncommon. I think it's a rather pesky problem for any software trying to cache process information, possibly even including Microsoft's own cybersecurity products. I suspect this is why they extended one of the crucial process kernel structures to include the truly unique ID I mentioned earlier.

It must be noted that the number of maximum PIDs also matters here: when the OS runs out of PIDs it will restart from 0. Therefore the smaller the max-PID, the more likely it is to hit the 0.01 secs window mentioned above, and thus breaking psutil algo. FWIW, on Ubuntu 22.04 max PID is 4.1 millions, which appears quite high (again, no info about Windows):

On Windows, the kernel uses 64-bit process IDs (PIDs), while the user mode API uses 32-bit PIDs. Technically, this allows for around 4 billion PIDs. However, in my experience, I have never seen a PID with more than six digits. Without PID reuse, I doubt even systems with terabytes of RAM could handle even a tiny fraction of that number. For example, 2,000 small processes on my machine consume around 20 GB of memory. I believe many things would break before Windows could handle 100,000 processes. Over time, processes are created and destroyed continuously, and if a computer runs for a long time, the total number of processes can add up. However, since PIDs are reused, I don’t think we’ll ever see PID numbers roll over on Windows

Excellent, thanks for letting me know. Since on Windows it's less clear how PIDs are assigned, this API looks particularly useful. We may determine API existence at runtime and do (in pseudo code): ...

👍

Are you proposing something like this?
>> psutil.pids_names_map()
{1: "foo.exe", 2: "bar.exe", ...}
Please note that if we use the fast create time method in Process.__init__ as I described above the slowdown issue should already be solved. That would make psutil.pids_names_map() basically as fast as list(psutil.process_iter(["name"])).
Yes and yes. Indeed my suggestion is moot if process enumeration would not automatically and always implicitly call "slow" function if GetProcessTimes fails.

I have a few more, somewhat related thoughts and idea. I could be wrong in my assumptions to begin with., Perhaps they are more questions than suggestions.

First, let's consider a scenario during a process iteration where a particular attribute triggers the invocation of a 'slow' function that collects detailed process information. It appears that the 'slow' function, psutil_get_proc_info, retrieves information for all processes but ignores any 'expensive' data except for the specified process. I suspect that in the next iteration, if the 'slow' function is the only way to obtain the data, it would repeat the slow call, even though the 'expensive' data was just collected a microsecond earlier. If this is the case, it could be resolved by retaining the 'expensive' process data until the entire process enumeration is complete. It is almost like oneshot call except in this case it is not needed since process iteration defines a perfect and tight scope,

Second,. The same also applies if you want to in rapid succession collect process information for a batch of PIDs. If I want to collect all process details for all chrome.exe processes, I do not want to have "slow" function acquire "expensive" data applicable to all processes to be called more than once. I do not know how to setup the scope though. Oneshot would not work here. Perhaps the following 3rd would address it but only if it is part of the process enumeration and not stand alone constrct.

Third, I believe that a filtered process iteration could be very useful, especially in cases where I need a large set of attributes for a subset of properties (at least based on name). It allows me to retrieve and set up 'expensive/slow' attributes in a single call when I only care about a subset of processes, rather than relying on 'fast' attributes to collect the process ID and name, and then manually filtering and opening them separately (which is what we're doing now, though process enumeration is not very fast yet).

Thank you again for this useful discussion. 🙏

giampaolo commented 1 month ago

It appears that the 'slow' function, psutil_get_proc_info, retrieves information for all processes but ignores any 'expensive' data except for the specified process. I suspect that in the next iteration, if the 'slow' function is the only way to obtain the data, it would repeat the slow call, even though the 'expensive' data was just collected a microsecond earlier.

This is correct. Internally psutil_get_proc_info on Windows retrieves multiple info for all PIDs (see source). So despite oneshot() caches multiple info on a per-process basis (1 PID), process_iter() could theoretically cache the entire psutil_get_proc_info data set for all PIDs, upfront.

This is easier said than done though. Windows is the only platform offering an API like this. As such psutil code and API evolved with the assumption that such a thing (retrieve info about all PIDs) couldn't be done. Also, process_iter() returns a generator, which means I can store psutil_get_proc_info result now, but the generator may be consumed later, and thus return outdated info.

I guess a separate brand new function could be provided, something like:

psutil.multi_proc_info()
{
    {1: {"user_time": ..., "system_time", ...}},
    {2: {"user_time": ..., "system_time", ...}},
    {3: {"user_time": ..., "system_time", ...}},
    ...
}

...but it would be Windows only and sort of different than the rest of the API. Maybe it could live under a new psutil.windows namespace though. Not sure.

iglendd commented 1 month ago

I agree totally with all your points.

This is correct. Internally psutil_get_proc_info on Windows retrieves multiple info for all PIDs (see source). So despite oneshot() caches multiple info on a per-process basis (1 PID), process_iter() could theoretically cache the entire psutil_get_proc_info data set for all PIDs, upfront.

This is easier said than done though. Windows is the only platform offering an API like this. As such psutil code and API evolved with the assumption that such a thing (retrieve info about all PIDs) couldn't be done. Also, process_iter() returns a generator, which means I can store psutil_get_proc_info result now, but the generator may be consumed later, and thus return outdated info.

Very good points naive implementation probably would satisfy quick enumeration but generator would not be good. I am not sure about Python generator semantics. Is there a way to see that complete enumeration had been done and we can drop cached information? Perhaps we can also rely on the time, if the cached information is 1/4 of a second old, get a new one?

his is correct. Internally psutil_get_proc_info on Windows retrieves multiple info for all PIDs (see source). So despite oneshot() caches multiple info on a per-process basis (1 PID), process_iter() could theoretically cache the entire psutil_get_proc_info data set for all PIDs, upfront.

This is easier said than done though. Windows is the only platform offering an API like this. As such psutil code and API evolved with the assumption that such a thing (retrieve info about all PIDs) couldn't be done. Also, process_iter() returns an iterator, which means I can store psutil_get_proc_info result now, but the iterator may be consumed later, and thus return outdated info.

I guess a separate brand new function could be provided, something like:
psutil.multi_proc_info()
{
{1: {"user_time": ..., "system_time", ...}},
{2: {"user_time": ..., "system_time", ...}},
{3: {"user_time": ..., "system_time", ...}},
...
}
Indeed conceptually it is cross-platform oddity. By the way can you provide a bit more details on how multi_proc_info call would look like in terms of API call.

I thought more of oneshot kind of semantics when the scope is defined outside but internally it influence regular and existing Process methods calls. What if in addition to Process.oneshot() we can add procutil.Oneshot(process iterator or list) which would keep cross-process context AND process private context (without their explicit definition) but overall semantics would be similar (I am thinking aloud)? Overtime it can keep globally per-scope affecting knobs and caches which could be useful on other OSes beyond automating per-process oneshot.

iglendd commented 1 month ago

Modeling, and especially retrofitting, an API is not easy. However, I want to share some of the reasons why I am eager to discuss various approaches. This is not to justify a particular implementation technique, but rather to provide additional context and perspective.

In some cases, we've observed that for customers with many running processes, process enumeration and data collection at regular intervals (every 15 seconds or every few minutes) can consume more than 50% of CPU usage, even on high-performance servers—far more than the rest of the large, busy application. When this feature is disabled, CPU usage drops to negligible levels. The root cause, which is now more apparent, is repeatedly calling for the same expensive data and discarding most of it over and over for each of the many processes.

giampaolo commented 1 month ago

Perhaps we can also rely on the time, if the cached information is 1/4 of a second old, get a new one?

Yeah, indeed. It probably means psutil.process_iter() should have an argument to tune the interval. batch_interval=0.25 or something. It'd be kind of a weird API though, and probably also not easy for the doc to explain how it works.

By the way can you provide a bit more details on how multi_proc_info call would look like in terms of API call.

You use NtQuerySystemInformation(SystemProcessInformation), and instead of filtering for one PID and discard the rest, you return a Python dict {pid: {...}, pid: {...}} for all PIDs.

What if in addition to Process.oneshot() we can add procutil.Oneshot(process iterator or list) which would keep cross-process context AND process private context (without their explicit definition) but overall semantics would be similar (I am thinking aloud)?

Hmm. Something like this?

with psutil.oneshot():
    for proc in psutil.proess_iter():
        ...

Maybe. This would have the extra advantage to work with Process classes, not only with process_iter():

with psutil.oneshot():
    psutil.Process(pid)

Not bad. It's something I've been pondering for a while actually. The oddity though, is that it requires relying on a global var (psutil._ONESHOT = True), which would be checked both by psutil.process_iter() and psutil.Process(). But global vars are not thread-safe. :(

Another possible idea could be psutil.process_iter(oneshot=True). No thread-safety issues there. It would reuse an API name ("oneshot") which already exists and is already known, and which does a similar thing, which is good. The drawback is that, differently from psutil.oneshot(), it would work with psutil.process_iter() but not with psutil.Process() if used directly without passing through process_iter.

Quite a brainstorming... :)

Modeling, and especially retrofitting, an API is not easy

Definitively. Long ago I blogged about it: https://gmpy.dev/blog/2013/making-constants-part-of-your-api-is-evil. Back then it was easier to fix mistakes and break compatibility. Today we can't. If something gets in, it's not like "it's forever" but... almost. I guess sometimes I probably look excessively cautious up here, mostly because of this.

In some cases, we've observed that for customers with many running processes, process enumeration and data collection at regular intervals (every 15 seconds or every few minutes) can consume more than 50% of CPU usage, even on high-performance servers—far more than the rest of the large, busy application.

Interesting. https://github.com/giampaolo/psutil/pull/2444 should alleviate some pain, but it depends on what you're doing in your code really. If your code calls one of these methods for multiple ADMIN process then you experience the slowdown, else you won't. Question is: do you really need to do that? If not, you may filter out those processes by using psutil.Process.username().

iglendd commented 1 month ago

Perhaps we can also rely on the time, if the cached information is 1/4 of a second old, get a new one?

Yeah, indeed. It probably means psutil.process_iter() should have an argument to tune the interval. batch_interval=0.25 or something. It'd be kind of a weird API though, and probably also not easy for the doc to explain how it works.

True

By the way can you provide a bit more details on how multi_proc_info call would look like in terms of API call.

You use NtQuerySystemInformation(SystemProcessInformation), and instead of filtering for one PID and discard the rest, you return a Python dict {pid: {...}, pid: {...}} for all PIDs.

I understand now, that is good

What if in addition to Process.oneshot() we can add procutil.Oneshot(process iterator or list) which would keep cross-process context AND process private context (without their explicit definition) but overall semantics would be similar (I am thinking aloud)?

Hmm. Something like this?

with psutil.oneshot(): for proc in psutil.proess_iter(): ... Maybe. This would have the extra advantage to work with Process classes, not only with process_iter():

with psutil.oneshot(): psutil.Process(pid) Not bad. It's something I've been pondering for a while actually. The oddity though, is that it requires relying on a global var (psutil._ONESHOT = True), which would be checked both by psutil.process_iter() and psutil.Process(). But global vars are not thread-safe. :(

Right, Introducing implicit globals is not good. I was hoping, not knowing internal details, that with global would create an implicit context which other parts of the code could automatically access. I read a little bit more about with and appear that it was a pipe dream.

In my opinion, we are effectively discussing introduction for implicit context to avoid passing it explicitly to the Process objects and having an odd interface and explanation, especially if it is would be primarily useful only for Windows. Maybe there is no elegant way of avoiding signature changes or strange constructs. I did find Context Managers (contextlib) which work with but would require yield style of invocations. But I also found ContextVars (Demystifying ContextVar in Python) which in theory can facilitate context creation and propagation in the oneshot() style. What do you think? At worse perhaps one can have an explicit context object which Process would be able to access to either by the fact that it could instantiate them or by more explicit connection method 🤷‍♂️. Maybe Python decorators or closures could help (A Deep Dive into Python’s Decorators and Context Managers)?

Another possible idea could be psutil.process_iter(oneshot=True). No thread-safety issues there. It would reuse an API name ("oneshot") which already exists and is already known, and which does a similar thing, which is good. The drawback is that, differently from psutil.oneshot(), it would work with psutil.process_iter() but not with psutil.Process() if used directly without passing through process_iter.

Quite a brainstorming... :)

Indeed :)

In some cases, we've observed that for customers with many running processes, process enumeration and data collection at regular intervals (every 15 seconds or every few minutes) can consume more than 50% of CPU usage, even on high-performance servers—far more than the rest of the large, busy application.

Interesting. https://github.com/giampaolo/psutil/pull/2444 should alleviate some pain, but it depends on what you're doing in your code really. If your code calls one of these methods for multiple ADMIN process then you experience the slowdown, else you won't. Question is: do you really need to do that? If not, you may filter out those processes by using psutil.Process.username().

Yes, this is an interesting topic in general and I can say even more about it since there are a few interesting aspects. Perhaps my points and how I am thinking about them even though specific for our case still could be useful for overall discussion as an extra context. Please bear with me.

Indeed https://github.com/giampaolo/psutil/pull/2444 would help us but would help only with one aspect - our internal pid/name caching. We use it to avoid collecting properties for processes we are not interested in. Per customer configuration actually, only a small subset of processes are collected. In some actual and extreme cases, e.g., we tracked "only" 80 processes out of 1500 running processes, so collecting information for other processes would be a CPU usage waste. Accordingly, we need to collect a handful attributes for a subset of processes and do it at the different stages.
That pid/name cache refreshes less frequently (let's say every 2 min) and between these refreshes we collect Process information (RSS, cpu, handles etc) only for processes we care about and track more frequently (let's say every 15 seconds). Above mentioned https://github.com/giampaolo/psutil/pull/2444 improvment will not help here since we go across multiple Process objects and could make many potentially slow calls. What makes things worse, and I did not realize it as well, that we make a call from a Windows service, which runs in Session 0, which has additional security isolation, and if the service user is not an Admin, all other sessions processes cannot be normally - they are like an Admin processes for non-admin session 0 users (I am not 100% sure about that but the number if errors to open processes exceeds number of Admin processes significantly). And here we go again - starting collecting NtQuerySystemInformation(SystemProcessInformation) only for one pid again and again.

iglendd commented 1 month ago

@giampaolo I would not normally say it in a github issue but if you reply next week I probably will not be able to reply back quickly since I will be on the PTO next week. But I certainly will after 🙏

giampaolo commented 3 weeks ago

I've just merged #2444, which should fix the problem described by OP. @iglendd thanks for the useful discussion. Let's continue it in https://github.com/giampaolo/psutil/issues/2454.

iglendd commented 3 weeks ago

@giampaolo, issue #2454 is related to batching/speeding up Linux API calls only, right? What about implicit batching of process information collection on Windows when the collector's privilege level is not high, as I mentioned in my latest comment? Are you still considering changes to address that in some way, or are you transferring this discussion to issue #2454? Your last comment suggests that, but the issue's title and scope seem different.

giampaolo commented 3 weeks ago

Yes, sorry, https://github.com/giampaolo/psutil/issues/2454 is for Linux (my mistake). There should be another ticket specific for Windows, basically a continuation of what me and you discussed in this ticket. I created https://github.com/giampaolo/psutil/issues/2463 for this.

iglendd commented 3 weeks ago

Thank you, I appreciate that 🙏