Closed rmoreas closed 3 minutes ago
Maybe add job.reset()
after line 27 of sync_actions.py
?
Hi Alan, could you have a quick look at this one? What do your think about the proposed solution?
Sorry for the delay...
At a glance, I think the proposed solution sounds reasonable. I would need to do some more reading about interrupt
but calling cleanup
after stopping a job seems to just make sense.
Sorry for the very long delay in this...
I have confirmed that the proposed fix does indeed cause the start
subcommand to return. There's no indication that the job was stopped in any way - it just exits with an error code of 0. If --progress
is set, the task count is set to nothing and the progress bar fills up once the job is stopped:
[Elapsed Time: 0:00:08] |##############################| (Time: 0:00:08) count: ------ tasks: ------ failures: ------ retries: ------
I think this is better than waiting forever (requiring the script to be terminated).
If we wanted to communicate that the job was stopped, we would need to add another Redis query to see that there are tasks but that the job is gone. I'm not sure how much value that adds.
There's no indication that the job was stopped in any way - it just exits with an error code of 0.
Do you know if the task still runs after the subcommand returns control?
I'm not really sure.
As far as I can tell, all "active" tasks are stopped, which means that revoke
is called with terminate=True
on the "count" handle, which is labeled as "queued tasks". Then, the main task is stopped, which means revoke
is called with terminate=False
on the job itself. Any tasks which were enqueued but never executed are - I think - scooped up in the revocation of the "active" tasks.
Fully understanding how this works is probably important. The other clue here is what reset
is actually doing that terminate
and cleanup
are not doing:
https://github.com/irods/irods_capability_automated_ingest/blob/dfbe556e2a1f74b7b0c59bf1210e212a1d86b9fe/irods_capability_automated_ingest/sync_job.py#L64-L69
count_handle
appears to be the list of enqueued tasks.
dequeue_handle
refers to the list of tasks which have been completed.
tasks_handle
is actually the task count. Why count_handle
is not the task count, I do not know.
failures_handle
is a count of tasks which have been executed but failed.
retries_handle
is a count of tasks which have been executed, failed, and retried.
done()
is checking to see whether the tasks_handle
count is 0 or None
. interrupt
revokes all of the enqueued tasks, but never resets the tasks_handle
count. Maybe... it should just do that...?
Then again, we probably want to reset everything else for the job, too, so, calling reset
is, in the end, the most straightforward approach.
That still does not solve the issue of communicating that the job was stopped, but I think that would require writing that down in Redis (a "stopped jobs" list, perhaps). Seems like a step beyond what is being requested in this issue.
Processes monitoring jobs which are stopped will now exit cleanly and print a warning message. Please open new issues for specific improvements if desired.
When a synchronous sync job is started, the command doesn't exit when the corresponding job is stopped thought the CLI. The start job command keeps waiting for
job.done()
forever.Because of this issue it's not possible to interrupt a synchronous sync job started from a cronjob. The cronjob keeps running forever, unless the process is killed.
The root cause seems to be that the
sync_job.done
function checks thetasks
key of the job, but this key is not removed by thestop_job
action.