Job launchers to release previous jobs on startup

waikato-ufdl / ufdl-backend

User-Friendly Deep Learning (UFDL) - backend system.

Apache License 2.0

1 stars 0 forks source link

Job launchers to release previous jobs on startup #85

Closed fracpete closed 4 years ago

fracpete commented 4 years ago

Job launchers may crash (eg server reboot) without being able to finish a job and notify the backend accordingly
A dangling job can prevent a worker node to acquire/execute new jobs when becoming available again
On startup, job launchers should be able to release any previous job

csterling commented 4 years ago

If a node crashes before it finishes a job, it still owns the job, so can query for its current job and then finalise/reset it.

fracpete commented 4 years ago

That could work. I presume, I would use job.list(...) with an appropriate filter to locate jobs for the node and then call job.reset_job(...). What would the filter expression look like?

csterling commented 4 years ago

Do an exact filter on the node field with the pk of the node. I.e.

{
    "expressions": [
        {
            "type": "exact",
            "field": "node",
            "value": 1
        }
    ]
}

fracpete commented 4 years ago

Implemented.