scarlehoff / pyHepGrid

Tool for distributed computing management geared towards HEP applications.
GNU General Public License v3.0
6 stars 4 forks source link

Resubmit failed jobs #34

Open GandalfTheWhite2 opened 5 years ago

GandalfTheWhite2 commented 5 years ago

Would it be possible to implement the option of resubmitting jobs with status FAILED? Sometimes (not very often ;-) ) jobs fail because of things unrelated to the job scripts (but because of a failure of file transfers etc). In that case it would be nice to be able to resubmit the jobs which failed - so e.g. the, 7 "subjobs" (in ganga-speak) of job N which failed. It could be an option --resubmit_failed -j N

DWalker487 commented 5 years ago

Hmmm, we already have a --resubmit flag (although it's for warmups only at the moment). I suspect the associated logic [programs.py] could be transplanted over to production mode relatively straightforwardly

GandalfTheWhite2 commented 5 years ago

That would IMHO be a huge improvement, and help reduce the "sometimes" large frustration caused by random failures.