giampaolo / psutil

Cross-platform lib for process and system monitoring in Python
BSD 3-Clause "New" or "Revised" License
10.32k stars 1.39k forks source link

Use logind instead of utmp because of Y2038 #2300

Open aplanas opened 1 year ago

aplanas commented 1 year ago

Summary

Description

Bi-arch systems line x86-64 present the Y2038 problem, where an overflow can be produced because some glibc compatibility decissions (see https://github.com/thkukuk/utmpx/blob/main/Y2038.md for more information)

This patch uses logind from systemd instead of utmp on Linux systems, if the systemd version is support the new API (>= 254).

aplanas commented 1 year ago

This approach is using macros to share the code, but maybe is cleaner if the function is split in two: one for utmp and other for logind. Is there any preference?

giampaolo commented 1 year ago

Mmmm... what a tricky problem. In order to reproduce the problem I've tried setting the system time to year 2050, then calling psutil.users(), but it didn't work. Do you have a way to reproduce it?

~/svn/psutil {master}$ sudo date -s "2050-11-22"
[sudo] password for giampaolo: 
mar 22 nov 2050, 00:00:00, CST
~/svn/psutil {master}$ date
mar 22 nov 2050, 00:00:01, CST
~/svn/psutil {master}$ python3 -c "import psutil, datetime; print(datetime.datetime.fromtimestamp(psutil.users()[0].started))"
2023-08-22 22:12:56

How about the other platforms? Are they affected as well?

aplanas commented 1 year ago

Mmmm... what a tricky problem. In order to reproduce the problem I've tried setting the system time to year 2050, then calling psutil.users(), but it didn't work. Do you have a way to reproduce it?

Uhm ... try to set the date and reboot. IIUC the issue is when the ut_tv struct gets created. There is a bit more info here: https://www.thkukuk.de/blog/Y2038_glibc_utmp_64bit/

As you can see the ut_tv field tv_sec is then declared as int32_t, so it is signed. This can store 2**31 seconds, so around 68 years (hence 1970 + 68 = 2038). The issue should be present when it is created, not when it is compared then.

How about the other platforms? Are they affected as well?

It is a glibc issue, so that depends if they are using the same struct with int32_t or using the struct timeval ut_tv; one. I do not know : (

* On Linux we use `setutent(), getutent(), endutent()` (so different APIs): https://github.com/giampaolo/psutil/blob/77e5b7445748d30d22c0e3b2e651414da96a88b4/psutil/_psutil_linux.c#L377

That is the one that I fixed here! : )

For the rest of the platforms, if they are affected (not sure about that), this solution is not applicable as it depends on systemd.

thkukuk commented 1 year ago

Some more documentation why this is necessary: https://www.thkukuk.de/blog/Y2038_glibc_utmp_64bit/

So for linux using systemd, coreutils. procps-ng, shadow, util-linux and more already implemented support to use logind instead of utmp or accepted PRs for this. glibc developers don't plan to fix the problem but instead they want to remove support for utmp/wtmp. The situation for other OS and libraries may be different, musl libc even did never support utmp, others are 64bit clean and thus Y2038 safe.

giampaolo commented 1 year ago

If binaries / wheels are produced with systemd support, but the Linux system installing the wheel doesn't have systemd, are we sure that this patch won't cause a "sd_get_sessions symbol not found" error on psutil import?

I remember we had a similar problem with prlimit() syscall on Linux. We used #ifdefs in the C code to check whether prlimit() was available at compile time, but on some old systems (CentOS) it wasn't, so the resulting .whl could be installed but crashed with "prlimit symbol not found" on psutil import (see code). For this reason I rewrote it by using ctypes (the only part of psutil using ctypes), so that the check for prlimit() availability is done at runtime instead of compile time: https://github.com/giampaolo/psutil/pull/1879/.

I'm not sure if the same problem may arise here as well. Are there Linux distros without systemd support and / or where sd_get_sessions() is not available? Is this even a problem?

aplanas commented 1 year ago

If binaries / wheels are produced with systemd support, but the Linux system installing the wheel doesn't have systemd, are we sure that this patch won't cause a "sd_get_sessions symbol not found" error on psutil import?

It will produce this error indeed. One solution can be what you describes:

Would this a preferred approach?

giampaolo commented 1 year ago

Yes, ctypes sounds like the safest bet at the time of writing, but it has a cost: slower speed + extra and convoluted code to maintain (I am also thinking about the dual unit tests).

I am skeptical how much of a problem this is in reality. Who sets the system date to >= 2035? If done, it looks like a user error. After setting the system date to 2050 all my browser tabs automatically logged out and I could not establish new HTTPS connections, so basically a system with a date that far in the future is not usable in general.

aplanas commented 1 year ago

I am skeptical how much of a problem this is in reality. Who sets the system date to >= 2035?

It will happen automatically in <12 15 years. Systems with long time support created now will suffer of this issue eventually.

I mean, the issue to fix is not that an user will set the date to 2050 by accident or because a specific use case, it is that a system that will be build in the next 2 5 years and needs to be supported for around 10 year more will stop working because of this bug.

Edit: math is hard

aplanas commented 1 year ago

On a second thought I think that dlopen() can be a better solution here. I will try this path.

aplanas commented 1 year ago

@giampaolo I updated the PR using dlopen(). Now there is no link to libsystemd, so a Linux system without the library will no produce the error of the missing symbol. Those are now dynamically loaded.

aplanas commented 1 year ago

@giampaolo ping?

aplanas commented 1 year ago

@giampaolo did you have a chance to evaluate the new approach here?

Meanwhile we added in openSUSE the patch in the package

giampaolo commented 1 year ago

Hi there.

@giampaolo I updated the PR using dlopen(). Now there is no link to libsystemd, so a Linux system without the library will no produce the error of the missing symbol. Those are now dynamically loaded.

I'm afraid this is the case only if you install psutil from sources. In the C code you do:

#ifdef SYSTEMD_LINUX
    #include <dlfcn.h>
#endif

...but SYSTEMD_LINUX is set only when setup.py is invoked (aka installation from sources). The problem we have is with the installation of wheels / binaries. If the wheel was compiled with systemd support, but the system which installs the wheel does not have systemd support, then the user will get "symbol not found". This is the annoying problem.

You should somehow avoid the #ifdef SYSTEMD_LINUX check.

aplanas commented 1 year ago

You should somehow avoid the #ifdef SYSTEMD_LINUX check.

Yes. Dropped.

aplanas commented 1 year ago

...but SYSTEMD_LINUX is set only when setup.py is invoked (aka installation from sources). The problem we have is with the installation of wheels / binaries. If the wheel was compiled with systemd support, but the system which installs the wheel does not have systemd support, then the user will get "symbol not found". This is the annoying problem.

But now that I recall ... this part is not true.

If the library is not present in the system, the load_library function will return NULL and the code will fallback to the current code that uses utmp.

The new macro dlsym_check is checking if the symbol is present in the library, and if one of the required symbols is not available, again it will return NULL and use the same fallback.

The SYSTEMD_LINUX condition is used only to remove this part in case that we want a source code that never checks for systemd.

So, the confusion matrix is like this (C = Compile time; R = At runtime; S = Systemd is present; N = No systemd present)

You should somehow avoid the #ifdef SYSTEMD_LINUX check.

Yes. Dropped.

I re-added it for now to have the same code-base for discussion. We can drop all the systemd checks and make the systemd code path always present (with the fallback to utmp) but the reason will not be for avoiding the "symbol not found" error, as it is already not present.

giampaolo commented 1 year ago

Me: You should somehow avoid the #ifdef SYSTEMD_LINUX check. You: Yes. Dropped. You: I re-added it for now to have the same code-base for discussion.

I'm confused now. :) Can you commit the "final" code?

giampaolo commented 1 year ago

I'm confused now. :) Can you commit the "final" code?

Also can you please avoid to squash commits and force push? It's more clear to see the individual commits and how they evolve.

aplanas commented 1 year ago

Me: You should somehow avoid the #ifdef SYSTEMD_LINUX check. You: Yes. Dropped. You: I re-added it for now to have the same code-base for discussion.

I'm confused now. :) Can you commit the "final" code?

: ) lets use the current one that contain the SYSTEMD_LINUX macro condition. We can keep it, so the systemd code path can disappear, or remove it, and both code path will remain.

aplanas commented 1 year ago

Also can you please avoid to squash commits and force push? It's more clear to see the individual commits and how they evolve.

Sure! I will keep the commits.

aplanas commented 1 year ago

Also, one side question from my side: how the wheels with the binary content are created? Are build in some shared Pypi build system, or comes directly from the developer infrastructure?

giampaolo commented 1 year ago

lets use the current one that contain the SYSTEMD_LINUX macro condition. We can keep it, so the systemd code path can disappear, or remove it, and both code path will remain.

The goal here should be avoid the "symbol not found error". Can we do that without using the #ifdef SYSTEMD_LINUX check? If yes, please remove it from the PR.

Also, one side question from my side: how the wheels with the binary content are created? Are build in some shared Pypi build system, or comes directly from the developer infrastructure?

They are created on each commit by Github Actions: https://github.com/giampaolo/psutil/blob/8d0a0433493e14535a86bd86a54e177afa4ecacc/.github/workflows/build.yml#L43 ...and saved as artifacts. You can then download them with make download-wheels-github.

aplanas commented 1 year ago

The goal here should be avoid the "symbol not found error". Can we do that without using the #ifdef SYSTEMD_LINUX check? If yes, please remove it from the PR.

I am a bit confused now : ). In the comment https://github.com/giampaolo/psutil/pull/2300#issuecomment-1772159232 I explained how this error is not there. Am I missing something?

aplanas commented 1 year ago

If yes, please remove it from the PR.

The last commit is removing the SYSTEMD_LINUX macro, making the systemd and utmp path available in any binary.

giampaolo commented 1 year ago

In preparation for your PR I made some refactoring, and created a specific psutil/arch/linux/users.c to host the new changes you are introducing here: https://github.com/giampaolo/psutil/pull/2320. Unfortunately you'll have to rebase / solve conflicts. I did it now instead of laster mostly to preserve (your) history once we merge.

aplanas commented 1 year ago

Unfortunately you'll have to rebase / solve conflicts.

Thanks for the refactoring, this makes the code much more clear.

I did the rebase, but I merged my two commits in one, to simplify the rebase.

aplanas commented 1 year ago

I added some initial comments but overall it looks like a pretty good start. In addition:

1. I prefer to have 2 separate functions, both of which can be called from Python (you'll have to update users.h and remove `static` from the function definitions). This way we can test that they return the same thing (in test_linux.py) and that they have no memory leaks (in test_memleaks.py).

Should be three then? The one used for Python users (psutil_users), the one for systemd (psutil_users_systemd) and the one for utmp (psutil_users_utmp)? The issues that I see are:

2. I think you should close the handle (`dlclose(handle)`) on exit.

Done.

3. perhaps it makes more sense to check for `sd_booted()` in the `load_systemd()` function (and close the handle in there if systemd didn't boot).

OK. Done.

4. please use 4 spaces indentation

Yes. Done.

giampaolo commented 1 year ago

Should be three then? The one used for Python users (psutil_users), the one for systemd (psutil_users_systemd) and the one for utmp (psutil_users_utmp)?

There should be 2: psutil_users_systemd returning None if unavailable, and psutil_users_utmp. In Python you will call psutil_users_systemd first, and fallback to psutil_users_utmp if it returns None.

the utmp and the systemd version are not returning the same value: there are different decisions done, for example in the tty field

What are the differences between the 2 versions? I assumed they were exactly the same.

aplanas commented 1 year ago

There should be 2: psutil_users_systemd returning None if unavailable, and psutil_users_utmp. In Python you will call psutil_users_systemd first, and fallback to psutil_users_utmp if it returns None.

Done.

I do not like tho. IMHO this breaks the API (there is no users() Python function anymore for Linux) and the alternative (using user() to point psutil_user_utmp) will silently keep the Y2038 problem to the users.

What are the differences between the 2 versions? I assumed they were exactly the same.

systemd unifies all this small differences in a more coherent way (application independent)

aplanas commented 1 year ago

What I meant is adding this logic to _pslinux.py:

Outch. Of course. Done.

I now realize though that there is some logic in the python layer that is specific to the utmp implementation:

We can fix that later, in another PR

Mmm this may be a problem. Are these differences documented somewhere?

None that I am aware. In utmp some differences are per-application

I tried running your code but it fails because 'sd_session_get_leader' is missing.

Added the requested logs. This function is new, from systemd v254: https://github.com/systemd/systemd/commit/403082602d4230c224529c46e2d8a392f3a50e49

giampaolo commented 1 year ago

Code did not compile so I fixed it via the web UI in https://github.com/giampaolo/psutil/pull/2300/commits/75570aba580786b43275374fddc8c3f43a2b0579 (you should update your branch).

In addition, a test is failing but I don't have write permissions to your branch. This is the way to fix it (in addition I also added memleak tests). Could you please apply this patch and push it?

+++ b/psutil/tests/test_linux.py
@@ -1513,25 +1513,27 @@ class TestMisc(PsutilTestCase):
                 psutil._pslinux.boot_time)
             assert m.called

-    def test_users_mocked(self):
+    def test_users_utmp_mocked(self):
         # Make sure ':0' and ':0.0' (returned by C ext) are converted
         # to 'localhost'.
-        with mock.patch('psutil._pslinux.cext.users',
-                        return_value=[('giampaolo', 'pts/2', ':0',
-                                       1436573184.0, True, 2)]) as m:
-            self.assertEqual(psutil.users()[0].host, 'localhost')
-            assert m.called
-        with mock.patch('psutil._pslinux.cext.users',
-                        return_value=[('giampaolo', 'pts/2', ':0.0',
-                                       1436573184.0, True, 2)]) as m:
-            self.assertEqual(psutil.users()[0].host, 'localhost')
-            assert m.called
-        # ...otherwise it should be returned as-is
-        with mock.patch('psutil._pslinux.cext.users',
-                        return_value=[('giampaolo', 'pts/2', 'foo',
-                                       1436573184.0, True, 2)]) as m:
-            self.assertEqual(psutil.users()[0].host, 'foo')
-            assert m.called
+        with mock.patch('psutil._pslinux.cext.users_systemd',
+                        return_value=None):
+            with mock.patch('psutil._pslinux.cext.users_utmp',
+                            return_value=[('giampaolo', 'pts/2', ':0',
+                                           1436573184.0, True, 2)]) as m:
+                self.assertEqual(psutil.users()[0].host, 'localhost')
+                assert m.called
+            with mock.patch('psutil._pslinux.cext.users_utmp',
+                            return_value=[('giampaolo', 'pts/2', ':0.0',
+                                           1436573184.0, True, 2)]) as m:
+                self.assertEqual(psutil.users()[0].host, 'localhost')
+                assert m.called
+            # ...otherwise it should be returned as-is
+            with mock.patch('psutil._pslinux.cext.users_utmp',
+                            return_value=[('giampaolo', 'pts/2', 'foo',
+                                           1436573184.0, True, 2)]) as m:
+                self.assertEqual(psutil.users()[0].host, 'foo')
+                assert m.called

     def test_procfs_path(self):
         tdir = self.get_testfn()
+++ b/psutil/tests/test_memleaks.py
@@ -485,6 +485,14 @@ class TestModuleFunctionsLeaks(TestMemoryLeak):
             name = next(psutil.win_service_iter()).name()
             self.execute(lambda: cext.winservice_query_descr(name))

+    if LINUX:
+
+        def test_users_systemd(self):
+            self.execute(cext.users_systemd)
+
+        def test_users_utmp(self):
+            self.execute(cext.users_utmp)
+

 if __name__ == '__main__':
     from psutil.tests.runner import run_from_name
aplanas commented 1 year ago

(you should update your branch)

Done, I also rebased on top of the aplanas-utmp branch, that contains your code

giampaolo commented 1 year ago

From Github CI https://github.com/giampaolo/psutil/actions/runs/6614641177/job/17968190988?pr=2300:

2023-10-23T15:42:36.3752124Z psutil-debug [psutil/arch/linux/users.c:44]> missing 'sd_session_get_leader' fun
2023-10-23T15:42:36.5224487Z psutil-debug [psutil/arch/linux/users.c:44]> missing 'sd_session_get_leader' fun

This means that the new code is currently not tested. I also cannot test it locally on Ubuntu 22.04 for the same reason (missing 'sd_session_get_leader'). What distro are you on? Can you run make test and make test-memleaks?

aplanas commented 1 year ago

What distro are you on?

openSUSE Tumbleweed, but I guess Arch will also have the systemd v254 there.

Can you run make test

That fails a lot. Some of the fails are assumptions about partitions, mount points (I am using btrfs with subvolumes, etc). But from the one that is related with this PR I see one:

======================================================================
FAIL: psutil.tests.test_posix.TestSystemAPIs.test_users (psutil=[suser(name='root', terminal=None, host='192.168.122.1', started=1698081882.810051, pid=1209)], who='root     pts/0        Oct 23 19:24   .          1227 (192.168.122.1)')
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/root/psutil/psutil/tests/test_posix.py", line 345, in test_users
    self.assertEqual(u.terminal, terminals[idx])
AssertionError: None != 'pts/0'

and make test-memleaks?

....
Ran 101 tests in 1.062s

OK (skipped=9)
SUCCESS
giampaolo commented 1 year ago

I tried playing with it a bit more and other than sd_session_get_leader I'm also missing sd_session_get_start_time and sd_session_get_username. I'm afraid it's too premature to add this right now. Most of these systemd APIs seem they were added very recently.

aplanas commented 1 year ago

Most of these systemd APIs seem they were added very recently.

All those are already present in v254

giampaolo commented 1 year ago

Let's not rush this. Considering these new APIs are literally 5 months old, it seems unlikely to me that we may cause issues to long-term support distros if we don't do this right now.

We don't have to wait years, but the minimum requirement here should be that the default Github Linux image has systemd >= v254, so that both utmp and systemd code paths are covered by unit tests. We can leave this PR open 'till then.

In the meantime, another couple of suggestions I can give for this PR:

giampaolo commented 1 year ago

I now realize though that there is some logic in the python layer that is specific to the utmp implementation: Perhaps it makes sense to translate this logic in C, and remove it from the Python layer, so that the 2 C implementations return the same things.

FYI I did this in 0c3a1c5a, so you need to update your branch again.

aplanas commented 1 year ago
* cache the `libsystemd.so.0` handle in a global var so that it's loaded only on first call

This will leak the handle, as the dlclose() cannot be called. Would this be OK?

giampaolo commented 1 year ago

Mmm no I don't think so. It will slightly increase memory usage the first time it's called, then it will just stay loaded in memory until the python process is terminated. The important thing is not to open a new handle on each call.

aplanas commented 1 year ago

The important thing is not to open a new handle on each call.

Done.

aplanas commented 1 year ago

@giampaolo I found what I think is a bug: if there is an error in the _utmp or _systemd version it will return NULL without setting the exception. The last commit is addressing this

aplanas commented 1 year ago

The test is also fixed: the issue was not code related.

For now I will partially update the patch living in the openSUSE package, and we will revisit this PR in some time in the future then.

Thanks a lot for the detailed review and all your patience working with me.

giampaolo commented 1 year ago

There's one last thing. From the man page https://manpages.debian.org/testing/libsystemd-dev/sd_session_get_username.3.en.html:

On success, sd_session_get_state(), sd_session_get_uid(), sd_session_get_seat(), 
sd_session_get_service(), sd_session_get_type(), sd_session_get_class(), 
sd_session_get_display(),sd_session_get_remote_user(), sd_session_get_remote_host() 
and sd_session_get_tty() return 0 or a positive integer. On failure, these calls return a 
negative errno-style error code.

Errors
Returned errors may indicate the following problems:

-ENXIO: The specified session does not exist.
-ENODATA: The given field is not specified for the described session.
-EINVAL: An input parameter was invalid (out of range, or NULL, where that is not accepted).
-ENOMEM: Memory allocation failed.

Right now on failure we get RuntimeError("cannot get user information via systemd"); without any indication of what syscall failed and why. Something like this should do it (not tested):

#include <stdlib.h>  // abs()

void set_systemd_errno(const char *syscall, int neg_errno) {
    PyObject *exc;
    int pos_errno;

    pos_errno = abs(neg_errno);
    sprintf(
        fullmsg, "%s (originated from %s)", strerror(pos_errno), syscall
    );
    exc = PyObject_CallFunction(PyExc_OSError, "(is)", pos_errno, fullmsg);
    PyErr_SetObject(PyExc_OSError, exc);
    Py_XDECREF(exc);
}

PyObject *
psutil_users_systemd(PyObject *self, PyObject *args) {
    int ret;

    ...

    ret = sd_session_get_username(session_id, &username);
    if (ret < 0) {
        set_systemd_errno("sd_session_get_username", ret);
        goto error;
    }

error:
   ...
   return NULL;  // NOT RuntimeError
aplanas commented 1 year ago

There's one last thing.

Adapted the code and tested it a bit. Done.