openpbs / openpbs

An HPC workload manager and job scheduler for desktops, clusters, and clouds.
https://www.openpbs.org
Other
709 stars 333 forks source link

Using queuejob and periodic event for the same hook encounters issues #1097

Open JonShelley opened 5 years ago

JonShelley commented 5 years ago

Goal: To be able to check each job select statement before allowing the job to run (Note: some jobs do not have select statements until after they have reached the server)

I using PBS 18.1.3 on CentOS 7.6. I am trying to use a queuejob hook to check job requests and modify the select statement as needed. For some jobs that use legacy (Torque) syntax I need to let them get queued before I can check the select statement. In this case, I use the queuejob hook to put a hold on the job. I then want to use the same hook in a periodic event to then find all jobs in a specific hold state (I use "so") and check their select statement, modify it, and release the hold.

This works as expected initially. However, after some time (1-2hrs) I see that the hook stops running periodically (as determined from the server logs). If I restart the server then the hook stars running again.

Also, I have seen jobs get rejected.

[testusera@ip-0A021004 default]$ qsub -lselect=2:ncpus=60 test.pbs qsub: queuejob event: rejected request

If I remove the periodic event from the hook and then resubmit the job it submits without issue.

Any thoughts?

vchlum commented 5 years ago

Hey @JonShelley My guess is that a memory leak in the python interpreter could cause your troubles. Please, see this page or the hook guide 18.2.3 (chapter 4.3) and try to adjust the server attributes (python_restart_max_hooks, python_restart_max_objects, python_restart_min_interval) appropriately.