The ExecutionController_verifyExecution() function now marks the execution status as failed if the current status is a running status ('recovering', 'running', 'failing', 'paused', 'stopping'). We can't start from a running state so the job will be stopped, but we were not updating the status previously. The assumption is that one of these statuses means that the execution controller pod has crashed and k8s is restarting a new pod. We would never get in this situation in native clustering, as there are check before the execution controller process is forked.
The ExecutionService_finishExecution() function also marks the execution status as failed if in a running status. It looks like this function is designed to run after an execution controller error and shutdown, so it's safe to assume a running status means there was an error and the execution controller shut down before status was updated.
This PR makes the following changes:
ExecutionController
_verifyExecution()
function now marks the execution status as failed if the current status is arunning
status ('recovering', 'running', 'failing', 'paused', 'stopping'). We can't start from a running state so the job will be stopped, but we were not updating the status previously. The assumption is that one of these statuses means that the execution controller pod has crashed and k8s is restarting a new pod. We would never get in this situation in native clustering, as there are check before the execution controller process is forked.ExecutionService
_finishExecution()
function also marks the execution status as failed if in arunning
status. It looks like this function is designed to run after an execution controller error and shutdown, so it's safe to assume a running status means there was an error and the execution controller shut down before status was updated.Ref: #2673