PBSspawner creates qsub and qstat zombie processes

nikl11 commented 3 years ago

Bug description

I run jupyterhub as webservice for our cluster users, each jupyter notebook is spawned onto a PBS node using batchspawner and its PBSSpawner, which works great. Unfortunately, everytime a user starts a new jupyter notebook, the qsub process (submitting request to the PBS system to allocate a node) and all qstat processes (periodically checking if a node has been allocated) become zombie processes and the only way to get rid of them is to kill the parent process jupyterhub (sending SIGCHLD doesnt solve it, I have to use kill -9 jupyterhub). Each request for a new jupyter notebook creates 1 qsub zombie process and 10-50 qstat zombie processes, and once the number reaches of total zombie processes reaches about 31700, the linux system cannot create any more processes and becomes basically unusable. This number is reached in a few days, so every night I have to manually restart jupyterhub to kill the zombies, but it also kills user's notebooks if they decide to work through the night.

Expected behaviour

Obviously I dont want jupyterhub to keep generating zombies. I think the bug is somewhere in batchspawner, as jupyterhub itself has nothing to do with qsub and qstat, these commands are used strictly by batchspawner.

Actual behaviour

Each user's request for a new jupyter notebook creates one qsub zombie process and about 10-50 qstat zombies (depending on how quickly a node is allocated, the longer it takes the more qstat zombies are created).

How to reproduce

If you have a system with PBS installed, and run jupyterhub with PBSSpawner, I expect you might see similar behaviour. But there is no other simple way to reproduce.

Your personal set up

jupyterhub 1.3.0 runs on ubuntu 18.04 cloud server with PBS Pro client installed to submit PBS commands onto the clusters with compute nodes, using PBSSpawner nad PAM kerberos as authentication (so each user is the "owner" of their zombie processes)

Thank you for any help, I firmly believe there is a bug in batchspawner but I have not been able to identify it nor fix it.

welcome[bot] commented 3 years ago

Thank you for opening your first issue in this project! Engagement like this is essential for open source projects! :hugs:
If you haven't done so already, check out Jupyter's Code of Conduct. Also, please try to follow the issue template as it helps other other community members to contribute more effectively. welcome You can meet the other Jovyans by joining our Discourse forum. There is also an intro thread there where you can stop by and say Hi! :wave:
Welcome to the Jupyter community! :tada:

consideRatio commented 3 years ago

@nikl11 excellent observation! I'm moving this issue here as the source code to start qstat and qsub is found here, which is what I presume will need a fix.

https://github.com/jupyterhub/batchspawner/blob/129951ad11e3049567b94adb8c9725dc22225a1a/batchspawner/batchspawner.py#L532-L535

consideRatio commented 3 years ago

@nikl11 I merged https://github.com/jupyterhub/batchspawner/pull/195 which could have fixed your issue. I have not done a thorough code inspection and I'm not a user of batchspawner so I really don't know though.

Could you see if it made a difference?

nikl11 commented 3 years ago

Unfortunately the problem still persists. It is very annoying, because when the PBS job (in which the notebook server spawns once a PBS compute node is allocated) is "pending in queue", jupyterhub and your batchspawner repeatedly call "qstat" to check the state of the PBS, roughly every second it seems like, but every qstat call is a separate process that is incorrectly waited for (the parent, jupyterhub process in our case, does not process a return code of these qstat processes, and each become a zombie, including the first "qsub" call to allocate the PBS compute node).

As a result, a single notebook spawn from a single user generates thousands of zombie processes per hour (3600 if qstat is called every second, I havent actually measured exactly). And because most unix systems start crashing when the total number of processes reach roughly 2^15-1, which is about 32000, even 10 notebook jobs waiting in a queue for an hour or two can completely take down the operating system and overwhelm it with zombie processes.

I am actually surprised nobody noticed this, because unless it is only my problem, which seems veery unlikely, everybody who uses jupyterhub with a batchspawner on PBS (maybe also other allocating systems like Torque etc., I havent checked) and has even a small community of a few dozens of users must be completely clogged with zombie processes.

Sending SICHLD to the jupyterhub process does not help, the only way to clean the zombies is to kill the parent jupyterhub process and restart it, which also means all current users' notebook server stop working because they lose connection to the hub. Currently I dont have that much traffic on my hub, maybe 2-5 users per day and the waiting time in queues is short usually, so it is enough to restart the jupyterhub every few nights, but if I had even 10-20 very active users per day, I would not be able to run jupyterhub with your PBS spawner.

Thank you for suggesting a solution.

mbmilligan commented 3 years ago

Hello @nikl11 -

I am actually surprised nobody noticed this, because unless it is only my problem, which seems veery unlikely, everybody who uses jupyterhub with a batchspawner on PBS (maybe also other allocating systems like Torque etc., I havent checked) and has even a small community of a few dozens of users must be completely clogged with zombie processes.

As you surmise, this would be quickly noticed in a production system of any size, of which there are several, so this is not a general problem with Batchspawner.

Looking at your setup, the two elements that I notice being uncommon among Batchspawner deployments are PBS Pro and Kerberos authentication. Most of our deployments use either Slurm or Torque, and use either sudo or job scheduler parameters to accomplish the user context switch. There could be an interaction there that we haven't previously encountered. Could you set your log level to debug and post logs that show the exact command that is generated for qstat and any associated messages?

jbaksta commented 3 years ago

I'll just make a note here that I've run JupyterHub that uses PBS Professional on a few systems as well as with Slurm and I've not had this particular issue. We don't use kerberos or Ubuntu. I'd probably start looking for when a UID changes (e.g., where batchspawner calls sudo) and making sure that elements aren't hanging. The reason I say this is it was close behavior (not completely) when sudo had failed or waiting on input or (in the case of a different spawner) when a user put a bunch of time consuming elements in the startup scripts that took longer than the poll interval.

nikl11 commented 3 years ago

Thanks @jbaksta but fixing this annoying bug is probably far above my abilities. I know this behaviour with hanging zombie processes generally happens if you don't call the wait() or wait_for() functions by the parent to properly collect return codes and let the processes die in piece. If you don't call wait(), the process ends up in a zombie state until you call SIGCHLD on the parent (doesnt work for me) or SIGKILL the parent which works for me, but completely kills and restarts the jupyterhub process so all users' running jupyter notebook servers lose connection to the hub and stop working completely (which is bad not only because users may be confused, and have to start a new notebook to resume work, but also because their pbs node allocation that run the original notebook still runs and eats resources, so a more thorough clean up is required otherwise resources and energy is wasted).

Right now I dont have many that users running many notebooks, so I can just get rid of the zombies by restarting my jupyterhub during nighttime when nobody is running a notebook. But unfortunately it has already happened that our gpu cluster was completely overloaded and users were waiting hours in the queue, which meant way too many qstat calls by the batchspawner and my whole local server got crippled by having almost 32 000 zombies, which seems to be a hard limit no matter what you set in ulimit, and after it is reached you cannot create new processes, even calling that stupid SIGKILL is often a problem.

If the problem was missing wait(), I would somehow fix it, but issue fix #195 has already done it seems like, if I understand it correctly it added synchronization, wait calls and clean up. After this I have no idea where to even begin, if you say that your PBS pro implementation doesnt do it, and mine does, then where do I even start, I doubt I am missing just a simple qsub or qstat parameter or a config setting (like qstat --kill-yourself-after-you-printout-the-output ...), it is gonna be a complex issue.

BTW Is there a way in batchspawner to set the interval at which qstat is called? Right now it calls every second, which seems a lot, one new zombie every second, I would rather have it at like 10-20s, even if it means the notebook may be loading a little bit slower for users...

jupyterhub / batchspawner