Closed mriedem closed 3 years ago
This sounds great! Can you make a PR?
This sounds great! Can you make a PR?
I can take a pass at this yeah. One thing that's difficult is not having a jupyterhub framework with tests in this repo to make sure everything is covered properly, but I'm also not sure how easy that is to add. I haven't setup a local jupyterhub for dev/test in awhile but have done it before with the dummy auth and sqlite db (also assuming the jupyterhub docker image should work OK). We also have a testing cluster I can probably poke against.
Are there any plans on adding tests to this repo? I wonder if simply having a workflow using the jupyterhub docker image with very basic config would at least be a starting point for some kind of integration testing of the culler.
Note to self but we don't need to do a version check probably because calling GET /users?state=inactive
against JupyterHub 1.2.2 just ignores the query parameter:
INFO 2021-03-15T17:56:05.799Z [JupyterHub log:181] 200 GET /hub/api/users?state=[secret] (5e18a4193f4a3f001127f809@10.241.6.29) 46.87ms
The only difference would be if we need to do the logic differently based on pre-filtering in the server.
So TLJH has two integration tests that test this, and maybe can be duplicated here?
I think I've got something that I can push up for initial review. It might not be the prettiest but I think it works. I tested locally with the jupyterhub/jupyterhub:latest (1.3.0) docker image with the testing config. I created three users where admin
is in the admin_users
set. Running the culler script without telling it to cull users found 0 ready servers (I also ran it while a server spawn was pending and that was filtered out). Running it and telling it to cull idle users it found the 3 users using the state=inactive
filter but only removed two of them:
$ jupyterhub-idle-culler --logging=debug --url=http://localhost:8000/hub/api --cull-every=30
[W 210315 16:04:28 __init__:439] Could not load pycurl: No module named 'pycurl'
pycurl is recommended if you have a large number of users.
[D 210315 16:04:28 selector_events:59] Using selector: EpollSelector
[D 210315 16:04:28 __init__:132] Got 0 ready users
$ jupyterhub-idle-culler --logging=debug --url=http://localhost:8000/hub/api --cull-every=30 --cull-users
[W 210315 16:04:44 __init__:439] Could not load pycurl: No module named 'pycurl'
pycurl is recommended if you have a large number of users.
[D 210315 16:04:44 selector_events:59] Using selector: EpollSelector
[D 210315 16:04:44 __init__:132] Got 0 ready users
[D 210315 16:04:44 __init__:361] Got 3 inactive users
[I 210315 16:04:44 __init__:321] Culling user osboxes (inactive for 1:33:21.953406)
[I 210315 16:04:44 __init__:321] Culling user mriedem (inactive for 1:04:57.594197)
[D 210315 16:04:44 __init__:337] Not culling user admin (created: 01:05:24, last active: 00:00:15)
[D 210315 16:04:44 __init__:372] Finished culling osboxes
[D 210315 16:04:44 __init__:372] Finished culling mriedem
Here I run it again while the server for user admin
is spawning (so pending
is truthy):
$ jupyterhub-idle-culler --logging=debug --url=http://localhost:8000/hub/api --cull-every=30 --cull-users
[W 210315 16:10:05 __init__:439] Could not load pycurl: No module named 'pycurl'
pycurl is recommended if you have a large number of users.
[D 210315 16:10:05 selector_events:59] Using selector: EpollSelector
[D 210315 16:10:05 __init__:132] Got 0 ready users
[D 210315 16:10:05 __init__:361] Got 0 inactive users
Proposed change
JupyterHub 1.3.0 supports server side filtering of the users with servers in a given state:
https://jupyterhub.readthedocs.io/en/stable/_static/rest-api/index.html#operation--users-get
Per this change: https://github.com/jupyterhub/jupyterhub/pull/3177
The idle-culler is currently just calling
GET /users
and filtering out users with pending servers on the client side:https://github.com/jupyterhub/jupyterhub-idle-culler/blob/v1.1/jupyterhub_idle_culler/__init__.py#L136
It should be faster to just do that filtering on the server side in the database if the hub is new enough (idle-culler could determine this by first checking the hub API version and if it's >=1.3.0 adding the
state
query parameter to theGET /users
request).Given the check here:
https://github.com/jupyterhub/jupyterhub-idle-culler/blob/v1.1/jupyterhub_idle_culler/__init__.py#L149
I guess we'd want
GET /users?state=ready
to filter outpending
servers.One issue might be any logic around culling users that don't have any active servers but
cull_users=True
, maybe two calls would need to be made in that case, one forGET /users?state=ready
and one forGET /users?state=inactive
?Alternative options
None, continue filtering client-side but that might be slower if there are a lot of users in a large hub.
Who would use this feature?
Anyone with a large number of users in their hub (we sometimes have 1K+ during large events).
(Optional): Suggest a solution
see above