irods / irods_capability_automated_ingest

Other
12 stars 15 forks source link

Synchronous start job command doesn't exit when the job is stopped with the cli #210

Closed rmoreas closed 3 minutes ago

rmoreas commented 1 year ago

When a synchronous sync job is started, the command doesn't exit when the corresponding job is stopped thought the CLI. The start job command keeps waiting for job.done() forever.

Because of this issue it's not possible to interrupt a synchronous sync job started from a cronjob. The cronjob keeps running forever, unless the process is killed.

The root cause seems to be that the sync_job.done function checks the tasks key of the job, but this key is not removed by the stop_job action.

rmoreas commented 1 year ago

Maybe add job.reset() after line 27 of sync_actions.py?

rmoreas commented 1 year ago

Hi Alan, could you have a quick look at this one? What do your think about the proposed solution?

alanking commented 1 year ago

Sorry for the delay...

At a glance, I think the proposed solution sounds reasonable. I would need to do some more reading about interrupt but calling cleanup after stopping a job seems to just make sense.

alanking commented 3 weeks ago

Sorry for the very long delay in this...

I have confirmed that the proposed fix does indeed cause the start subcommand to return. There's no indication that the job was stopped in any way - it just exits with an error code of 0. If --progress is set, the task count is set to nothing and the progress bar fills up once the job is stopped:

[Elapsed Time: 0:00:08] |##############################| (Time:  0:00:08) count: ------ tasks: ------ failures: ------ retries: ------

I think this is better than waiting forever (requiring the script to be terminated).

If we wanted to communicate that the job was stopped, we would need to add another Redis query to see that there are tasks but that the job is gone. I'm not sure how much value that adds.

korydraughn commented 3 weeks ago

There's no indication that the job was stopped in any way - it just exits with an error code of 0.

Do you know if the task still runs after the subcommand returns control?

alanking commented 3 weeks ago

I'm not really sure.

As far as I can tell, all "active" tasks are stopped, which means that revoke is called with terminate=True on the "count" handle, which is labeled as "queued tasks". Then, the main task is stopped, which means revoke is called with terminate=False on the job itself. Any tasks which were enqueued but never executed are - I think - scooped up in the revocation of the "active" tasks.

Fully understanding how this works is probably important. The other clue here is what reset is actually doing that terminate and cleanup are not doing: https://github.com/irods/irods_capability_automated_ingest/blob/dfbe556e2a1f74b7b0c59bf1210e212a1d86b9fe/irods_capability_automated_ingest/sync_job.py#L64-L69

count_handle appears to be the list of enqueued tasks. dequeue_handle refers to the list of tasks which have been completed. tasks_handle is actually the task count. Why count_handle is not the task count, I do not know. failures_handle is a count of tasks which have been executed but failed. retries_handle is a count of tasks which have been executed, failed, and retried.

done() is checking to see whether the tasks_handle count is 0 or None. interrupt revokes all of the enqueued tasks, but never resets the tasks_handle count. Maybe... it should just do that...?

alanking commented 3 weeks ago

Then again, we probably want to reset everything else for the job, too, so, calling reset is, in the end, the most straightforward approach.

That still does not solve the issue of communicating that the job was stopped, but I think that would require writing that down in Redis (a "stopped jobs" list, perhaps). Seems like a step beyond what is being requested in this issue.

alanking commented 3 minutes ago

Processes monitoring jobs which are stopped will now exit cleanly and print a warning message. Please open new issues for specific improvements if desired.