riga / law

Build large-scale task workflows: luigi + job submission + remote targets + environment sandboxing using Docker/Singularity
http://law.readthedocs.io
BSD 3-Clause "New" or "Revised" License
98 stars 41 forks source link

Summarize remote jobs failure modes #135

Closed lmoureaux closed 1 year ago

lmoureaux commented 1 year ago

When remote jobs fail, Law prints an error message like looking like this:

95 failed job(s) in task TrainDEAllSignalRegions__1__False_8b0c23d80f:
    job: 24, branch(es): 23, id: 164606, status: retry, code: 0, error: BeginTime
    job: 29, branch(es): 28, id: 168827, status: retry, code: 0, error: TIMEOUT
    job: 31, branch(es): 30, id: 161186, status: retry, code: 0, error: QOSMaxJobsPerUserLimit
    job: 35, branch(es): 34, id: 168828, status: retry, code: 0, error: TIMEOUT
    job: 36, branch(es): 35, id: 164611, status: retry, code: 60, error: BeginTime, job script error: task execution failed
    ... and 90 more

The code is here: https://github.com/riga/law/blob/70a223f57512e9e3e377a27709ad74067dacf07f/law/workflow/remote.py#L945

I think it would be more informative to have a breakdown by the type of error, something like:

95 failed job(s) in task TrainDEAllSignalRegions__1__False_8b0c23d80f:
    42 branches in 42 jobs with status: retry, code: 0, error: BeginTime
    42 branches in 42 jobs with status: retry, code: 0, error: TIMEOUT
    10 branches in 10 jobs with status: retry, code: 0, error: QOSMaxJobsPerUserLimit
    3 branches in 3 jobs with status: retry, code: 0, error: BeginTime, job script error: task execution failed
riga commented 1 year ago

Sounds good, and this should be easy to achieve.