Open tallakahath opened 8 years ago
OK, by using/abusing the 'hostname' field this is fixed. Branch and commit tallakahath/pbs@d24a465 fix this by adding a few lines and changing the SQL query:
# Parse our hostname so we can only select jobs from THIS host
# Otherwise, if we're on a multiple-clusters-same-home setup,
# we may incorrectly update jobs from one cluster onto the other
m = m = re.search(r"(.*?)(?=[^a-zA-Z0-9]*login.*)", self.hostname) #pylint: disable=invalid-name
if m:
hostname_regex = m.group(1) + ".*"
else:
hostname_regex = self.hostname + ".*"
# select jobs that are not yet marked complete
self.curs.execute("SELECT jobid FROM jobs WHERE jobstatus!='C' AND hostname REGEXP ?",
(hostname_regex, ))
So now, the current hostname is grabbed, stripped (e.g. mycluster-login1 turns into mycluster, as all login nodes of mycluster should have the same queue), then regexp'd against existing hostname entries. As a result, if you're on cluster A, logged in to A-login1 or something, a JobDB.udpate call would only call jobs C[omplete] if they belong to A AND are not being returned by a qstat/squeue called on A.
Also, this commit can be cherry-picked and applied to stock pbs without needing all of my config/SLURM stuff.
Specific (generic) example:
A computing system has three clusters, A, B, and C. Each cluster has its own queue (i.e., jobs submitted while logged into A do NOT show up when logged into B and running qstat/squeue), BUT, all clusters share the same /home (e.g., A:/home/liz and B:/home/liz point to the same place).
If I run any pbs command (e.g. pstat) on A, /home/liz/.pbs/jobs.db is created and populated with info from A's queue. If I then construct a pbs.Job, and then pbs.Job.submit, a job is entered into /home/liz.pbs/jobs.db; lets call this job 1001. 1001 is marked as 'Q' and then 'R' as I run 'pstat' a few times and wait for the queue to clear.
Now, I exit A, and ssh into B. I run 'pstat' again, and pbs.JobDB.update is called. squeue/qstat doesn't see the job I submitted on A/in A's queue, so, as per the behavior in pbs/pbs/jobdb/update:
The job I submitted on A, 1001, then gets marked as "C" in job.db. Now, even if I ssh back into A and run pstat again, I have a problem:
Job 1001 is never checked during JobDB.update ever again, because pbs thinks its complete!
I'm not sure the best way of handling this, but maybe having jobs.db carry around data about which cluster/queue was used is needed, here. Then, only jobs native to that cluster/queue can be updated (and hence, since 1001 is native to A, a 'pstat' query yielding a JobDB.update call will NOT check 1001) The naive solution is to change to
thereby checking all jobs in the jobs.db. But this will quickly become time-consuming if a user is not regularly purging ~/.pbs/jobs.db (which they shouldn't have to do!).
Thoughts?