prominence-eosc / prominence

PROMINENCE server
Apache License 2.0
2 stars 0 forks source link

Optionally allow jobs to remain in queue until resources become available #91

Closed alahiff closed 4 years ago

alahiff commented 5 years ago

Currently if there are no resources available for a job it will fail immediately and it's up to the user to resubmit. This is Shaun's preferred model. However, many users expect traditional batch system type behaviour, i.e. jobs remain in queues, and can be confused with the different behaviour in PROMINENCE.

We should give users a choice, e.g. jobs fail immediately or jobs can wait for up to X mins for deployment.

alahiff commented 4 years ago

For the case of workflows jobs will remain in the queue until they can be run.

For jobs, can have the folllowing in policies:

but they are not actually used anywhere yet.

alahiff commented 4 years ago

I think only the following change is required to allow maximumTimeInQueue to work:

SYSTEM_PERIODIC_REMOVE in /etc/condor/config.d/40-rules contains isUndefined(DAGManJobId), i.e. it will completely remove failed jobs if they are not from a workflow. Will need to change this to something like:

isUndefined(DAGManJobId) && (ProminenceMaxTimeInQueue == -1 || ProminenceMaxTimeInQueue > -1 && CurrentTime - QDate > ProminenceMaxTimeInQueue)
alahiff commented 4 years ago

Closing this as it has been implemented and tested.