jupyterhub / batchspawner

Custom Spawner for Jupyterhub to start servers in batch scheduled systems
BSD 3-Clause "New" or "Revised" License
190 stars 134 forks source link

deadlocked accounts - jhub removes route but does not delete server? #314

Open lodott72 opened 5 hours ago

lodott72 commented 5 hours ago

Hi,

so we have a jupyterhub instance (v 5.0.0) which uses a htcondor batch cluster to spawn server jobs on.

Occasionally we run into deadlocked user accounts which have a server route but the server job is gone and the jhub does not notice - the only solution currently is to delete the user account completely to get rid of the route.

Today we had another case and going through the logs I understand that this is caused (in this case) by the jhub somehow detecting a server stop and removing the route, but readding it shortly there after, because it is "missing" - meaning, the server entry is probably still in the db even after the route is removed.

This feels like a bug to me, maybe a timing issue? If routes and servers are removed together, it seems like the deadlock would be avoided. Hoping for more insights...

Best Kruno

Here are the logs (jupyterhub + htcondor), boiled down to the relevant parts, with some comments added by me

# last poll for the job
[D 2024-11-15 15:26:50.332 JupyterHub batchspawner:314] Spawner querying job: sudo -E -u lodott /var/lib/condor/util/condor_q.sh 1612
# server stop detected by jhub - don't know how/why
# there should have been two more polls here (every 30secs)
[W 2024-11-15 15:28:17.258 JupyterHub base:1290] User lodott server stopped, with exit code: 1
# jhub request route removes from proxy
[I 2024-11-15 15:28:17.258 JupyterHub proxy:356] Removing user lodott from proxy (/user/lodott/)
# user requests are ongoing but eventually run into a Timeout exception
# I think the exception might be responsible for killing the poll here
[E 2024-11-15 15:28:55.991 JupyterHub spawner:1459] Unhandled error in poll callback for <CondorOptionsSpawner object at 0x7fbe07e0bf50>
...
TimeoutError: Repeated api_request to proxy path "/user/lodott" failed.
### lots of client requests with 302 followed 500, because no route
# periodic route check puts back route because its "missing" - the server entry in db was apparently not deleted
[D 2024-11-15 15:31:16.190 JupyterHub proxy:392] Checking routes
...
[W 2024-11-15 15:31:16.208 JupyterHub proxy:416] Adding missing route for /user/lodott/ (Server(url=http://batchj004:40002/user/lodott/, bind_url=http://batchj004:40002/user/lodott/))
### client requests with 302 which are now routed again but server state is unclear
# htcondor actually removes notebook job due to 
11/15/24 17:39:11 (1612.0) (1799193): Job 1612.0 is being removed: Job removed by SYSTEM_PERIODIC_REMOVE due to Job runtime longer than reserved.
# at this stage the htcondor job is gone but since there is no polling, the jhub never notices - deadlock
rcthomas commented 3 hours ago

We've observed something similar with a different batchspawner derivative (Slurm). And the fix is to delete the user and restart the hub. We've been unable to reliably trigger it to reproduce it, so for now we just monitor for it and clear it when we run into the issue. Sometimes it occurs after issues on the system.

Our setup includes a layer of wrapspawner on top, and what we've seen is that when users are in this situation they have an empty child spawner state --- no Job ID for instance, so it's like there's no way to verify the job is running. Tracing through the code I kind of had the same conclusion you came to, something weird happening during stop().

When you have a user in this state, have you looked at the server table entry in the database for that user to see what it looks like?

lodott72 commented 2 hours ago

Looking through the current users I found another user with a route entry in the proxy but no running job in htcondor.

The spawner of the user has a job id and a job poll indicating that the job is running at the node set in the route - which means that the poll cannot be running anymore (since the job was removed about two hours ago).

The spawner also has a server associated without it, with the same node and additionally the port. Which is not surprising, since the periodic route check (every 5 minutes) should have deleted the route otherwise.

This is very similar to the other case, except that the ConfigProxy never logged that it actually deleted the route (which I left out of my log summary). The route seems to have stayed in the proxy all the time, but the poll also had the timeout exception and didn't continue. When the htcondor job was removed, the jhub did not notice.