Filter users in the server if possible

mriedem commented 3 years ago

Proposed change

JupyterHub 1.3.0 supports server side filtering of the users with servers in a given state:

https://jupyterhub.readthedocs.io/en/stable/_static/rest-api/index.html#operation--users-get

Per this change: https://github.com/jupyterhub/jupyterhub/pull/3177

The idle-culler is currently just calling GET /users and filtering out users with pending servers on the client side:

https://github.com/jupyterhub/jupyterhub-idle-culler/blob/v1.1/jupyterhub_idle_culler/__init__.py#L136

It should be faster to just do that filtering on the server side in the database if the hub is new enough (idle-culler could determine this by first checking the hub API version and if it's >=1.3.0 adding the state query parameter to the GET /users request).

Given the check here:

https://github.com/jupyterhub/jupyterhub-idle-culler/blob/v1.1/jupyterhub_idle_culler/__init__.py#L149

I guess we'd want GET /users?state=ready to filter out pending servers.

One issue might be any logic around culling users that don't have any active servers but cull_users=True, maybe two calls would need to be made in that case, one for GET /users?state=ready and one for GET /users?state=inactive?

Alternative options

None, continue filtering client-side but that might be slower if there are a lot of users in a large hub.

Who would use this feature?

Anyone with a large number of users in their hub (we sometimes have 1K+ during large events).

(Optional): Suggest a solution

see above

yuvipanda commented 3 years ago

This sounds great! Can you make a PR?

mriedem commented 3 years ago

This sounds great! Can you make a PR?

I can take a pass at this yeah. One thing that's difficult is not having a jupyterhub framework with tests in this repo to make sure everything is covered properly, but I'm also not sure how easy that is to add. I haven't setup a local jupyterhub for dev/test in awhile but have done it before with the dummy auth and sqlite db (also assuming the jupyterhub docker image should work OK). We also have a testing cluster I can probably poke against.

Are there any plans on adding tests to this repo? I wonder if simply having a workflow using the jupyterhub docker image with very basic config would at least be a starting point for some kind of integration testing of the culler.

mriedem commented 3 years ago

Note to self but we don't need to do a version check probably because calling GET /users?state=inactive against JupyterHub 1.2.2 just ignores the query parameter:

INFO 2021-03-15T17:56:05.799Z [JupyterHub log:181] 200 GET /hub/api/users?state=[secret] (5e18a4193f4a3f001127f809@10.241.6.29) 46.87ms

The only difference would be if we need to do the logic differently based on pre-filtering in the server.

yuvipanda commented 3 years ago

So TLJH has two integration tests that test this, and maybe can be duplicated here?

mriedem commented 3 years ago

I think I've got something that I can push up for initial review. It might not be the prettiest but I think it works. I tested locally with the jupyterhub/jupyterhub:latest (1.3.0) docker image with the testing config. I created three users where admin is in the admin_users set. Running the culler script without telling it to cull users found 0 ready servers (I also ran it while a server spawn was pending and that was filtered out). Running it and telling it to cull idle users it found the 3 users using the state=inactive filter but only removed two of them:

$ jupyterhub-idle-culler --logging=debug --url=http://localhost:8000/hub/api --cull-every=30
[W 210315 16:04:28 __init__:439] Could not load pycurl: No module named 'pycurl'
    pycurl is recommended if you have a large number of users.
[D 210315 16:04:28 selector_events:59] Using selector: EpollSelector
[D 210315 16:04:28 __init__:132] Got 0 ready users
$ jupyterhub-idle-culler --logging=debug --url=http://localhost:8000/hub/api --cull-every=30 --cull-users
[W 210315 16:04:44 __init__:439] Could not load pycurl: No module named 'pycurl'
    pycurl is recommended if you have a large number of users.
[D 210315 16:04:44 selector_events:59] Using selector: EpollSelector
[D 210315 16:04:44 __init__:132] Got 0 ready users
[D 210315 16:04:44 __init__:361] Got 3 inactive users
[I 210315 16:04:44 __init__:321] Culling user osboxes (inactive for 1:33:21.953406)
[I 210315 16:04:44 __init__:321] Culling user mriedem (inactive for 1:04:57.594197)
[D 210315 16:04:44 __init__:337] Not culling user admin (created: 01:05:24, last active: 00:00:15)
[D 210315 16:04:44 __init__:372] Finished culling osboxes
[D 210315 16:04:44 __init__:372] Finished culling mriedem

Here I run it again while the server for user admin is spawning (so pending is truthy):

$ jupyterhub-idle-culler --logging=debug --url=http://localhost:8000/hub/api --cull-every=30 --cull-users
[W 210315 16:10:05 __init__:439] Could not load pycurl: No module named 'pycurl'
    pycurl is recommended if you have a large number of users.
[D 210315 16:10:05 selector_events:59] Using selector: EpollSelector
[D 210315 16:10:05 __init__:132] Got 0 ready users
[D 210315 16:10:05 __init__:361] Got 0 inactive users

jupyterhub / jupyterhub-idle-culler