Open JonShelley opened 5 years ago
Hey @JonShelley My guess is that a memory leak in the python interpreter could cause your troubles. Please, see this page or the hook guide 18.2.3 (chapter 4.3) and try to adjust the server attributes (python_restart_max_hooks, python_restart_max_objects, python_restart_min_interval) appropriately.
Goal: To be able to check each job select statement before allowing the job to run (Note: some jobs do not have select statements until after they have reached the server)
I using PBS 18.1.3 on CentOS 7.6. I am trying to use a queuejob hook to check job requests and modify the select statement as needed. For some jobs that use legacy (Torque) syntax I need to let them get queued before I can check the select statement. In this case, I use the queuejob hook to put a hold on the job. I then want to use the same hook in a periodic event to then find all jobs in a specific hold state (I use "so") and check their select statement, modify it, and release the hold.
This works as expected initially. However, after some time (1-2hrs) I see that the hook stops running periodically (as determined from the server logs). If I restart the server then the hook stars running again.
Also, I have seen jobs get rejected.
[testusera@ip-0A021004 default]$ qsub -lselect=2:ncpus=60 test.pbs qsub: queuejob event: rejected request
If I remove the periodic event from the hook and then resubmit the job it submits without issue.
Any thoughts?