Closed alahiff closed 4 years ago
For the case of workflows jobs will remain in the queue until they can be run.
For jobs, can have the folllowing in policies
:
maximumTimeInQueue
: maximum time a job can be queued before it is automatically deleted.maximumIdleTimePerResource
: maximum time job can remain queued at an individual site before trying elsewhere (mainly aimed at HTC/HPC resources, where they have their own queues of course)but they are not actually used anywhere yet.
I think only the following change is required to allow maximumTimeInQueue
to work:
SYSTEM_PERIODIC_REMOVE
in /etc/condor/config.d/40-rules
contains isUndefined(DAGManJobId)
, i.e. it will completely remove failed jobs if they are not from a workflow. Will need to change this to something like:
isUndefined(DAGManJobId) && (ProminenceMaxTimeInQueue == -1 || ProminenceMaxTimeInQueue > -1 && CurrentTime - QDate > ProminenceMaxTimeInQueue)
Closing this as it has been implemented and tested.
Currently if there are no resources available for a job it will fail immediately and it's up to the user to resubmit. This is Shaun's preferred model. However, many users expect traditional batch system type behaviour, i.e. jobs remain in queues, and can be confused with the different behaviour in PROMINENCE.
We should give users a choice, e.g. jobs fail immediately or jobs can wait for up to X mins for deployment.