jupyterhub / batchspawner

Custom Spawner for Jupyterhub to start servers in batch scheduled systems
BSD 3-Clause "New" or "Revised" License
190 stars 134 forks source link

PBSspawner creates qsub and qstat zombie processes #206

Closed nikl11 closed 3 years ago

nikl11 commented 3 years ago

Bug description

I run jupyterhub as webservice for our cluster users, each jupyter notebook is spawned onto a PBS node using batchspawner and its PBSSpawner, which works great. Unfortunately, everytime a user starts a new jupyter notebook, the qsub process (submitting request to the PBS system to allocate a node) and all qstat processes (periodically checking if a node has been allocated) become zombie processes and the only way to get rid of them is to kill the parent process jupyterhub. Each request for a new jupyter notebook creates 1 qsub zombie process and 10-50 qstat zombie processes, and once the number reaches of total zombie processes reaches about 31700, the linux system cannot create any more processes and becomes basically unusable. This number is reached in a few days, so every night I have to manually restart jupyterhub to kill the zombies, but it also kills user's notebooks if they decide to work through the night.

Expected behaviour

Obviously I dont want jupyterhub to keep generating zombies. I think the bug is somewhere in batchspawner, as jupyterhub itself has nothing to do with qsub and qstat, these commands are used strictly by batchspawner.

Actual behaviour

Each user's request for a new jupyter notebook creates one qsub zombie process and about 10-50 qstat zombies (depending on how quickly a node is allocated, the longer it takes the more qstat zombies are created).

How to reproduce

If you have a system with PBS installed, and run jupyterhub with PBSSpawner, I expect you might see similar behaviour. But there is no other simple way to reproduce.

Your personal set up

jupyterhub 1.3.0 runs on ubuntu 18.04 cloud server with PBS Pro client installed to submit PBS commands onto the clusters with compute nodes, using PBSSpawner nad PAM kerberos as authentication (so each user is the "owner" of their zombie processes)

Thank you for any help, I firmly believe there is a bug in batchspawner but I have not been able to identify it nor fix it.