Open lodott72 opened 5 hours ago
We've observed something similar with a different batchspawner derivative (Slurm). And the fix is to delete the user and restart the hub. We've been unable to reliably trigger it to reproduce it, so for now we just monitor for it and clear it when we run into the issue. Sometimes it occurs after issues on the system.
Our setup includes a layer of wrapspawner on top, and what we've seen is that when users are in this situation they have an empty child spawner state --- no Job ID for instance, so it's like there's no way to verify the job is running. Tracing through the code I kind of had the same conclusion you came to, something weird happening during stop()
.
When you have a user in this state, have you looked at the server table entry in the database for that user to see what it looks like?
Looking through the current users I found another user with a route entry in the proxy but no running job in htcondor.
The spawner of the user has a job id and a job poll indicating that the job is running at the node set in the route - which means that the poll cannot be running anymore (since the job was removed about two hours ago).
The spawner also has a server associated without it, with the same node and additionally the port. Which is not surprising, since the periodic route check (every 5 minutes) should have deleted the route otherwise.
This is very similar to the other case, except that the ConfigProxy never logged that it actually deleted the route (which I left out of my log summary). The route seems to have stayed in the proxy all the time, but the poll also had the timeout exception and didn't continue. When the htcondor job was removed, the jhub did not notice.
Hi,
so we have a jupyterhub instance (v 5.0.0) which uses a htcondor batch cluster to spawn server jobs on.
Occasionally we run into deadlocked user accounts which have a server route but the server job is gone and the jhub does not notice - the only solution currently is to delete the user account completely to get rid of the route.
Today we had another case and going through the logs I understand that this is caused (in this case) by the jhub somehow detecting a server stop and removing the route, but readding it shortly there after, because it is "missing" - meaning, the server entry is probably still in the db even after the route is removed.
This feels like a bug to me, maybe a timing issue? If routes and servers are removed together, it seems like the deadlock would be avoided. Hoping for more insights...
Best Kruno
Here are the logs (jupyterhub + htcondor), boiled down to the relevant parts, with some comments added by me