preservation-ingest-initiated, publish-complete, shelve-complete workflow steps should be resettable

andrewjbtw commented 4 years ago

Describe the bug

The preservation-ingest-initiated, publish-complete, and shelve-complete steps cannot be re-run from Argo. If you set them to "waiting" they will remain waiting forever. The only way to have these steps re-run is to re-run the immediately preceding step. This usually has to be done from the rails console or with a rake task on workflow-server-rails.

User impact

For general users, this is a confusing and frustrating situation because they can use Argo to "re-run" or set one of these steps to waiting, but nothing will happen.

For repository administrators, this is a frustrating situation because it can be a time consuming process to reset these steps from workflow-server-rails. Also, if a user has previously reset one of these steps, it can be difficult to track down the issue that led the user to reset the step because the act of resetting will clear up the error message.

Additional context

In previous discussion a few months ago, it was thought that there was no way to reset sidekiq jobs from Argo. But technical metadata also uses sidekiq and is resettable from Argo, so this seems possible to implement. See Justin Littman's comment below for background on technical-metadata: https://github.com/sul-dlss/argo/issues/1789#issuecomment-617730236

Previous description (retained in case we can't make the steps resettable from Argo again)

Describe the bug The sdr-ingest-transfer and preservation-ingest-initiated steps in the accessionWF do not report status correctly to Argo. As far as I can tell, they display either "waiting" or "completed" but not "queued" or "started" in the Argo interface.

This makes it hard to tell when work is being done for these steps.

I also cannot find their jobs on the resque list even when I know one of these steps must be doing work. I've spent time watching druids go through the robots console and I see steps before and after these two steps, but never these two steps.

The dor-services Sidekiq console does show jobs, which I assume includes the jobs these steps run.

User Impact For long running jobs, it's difficult to tell if work is still happening at these steps. This has led me to reset some jobs that I thought had died but which were still running. I suspect that eventually led to some errors I've been seeing with bag validation on this one giant druid because more than one job was working on it at the same time, even after the bag manifest was created.

This is the specific druid I was working on when I found this issue: https://argo.stanford.edu/view/druid:hn264yt9694

That druid includes a 1.3 TB tar file, and was failing to complete some steps back when the sweeper was catching it at 12 hours. After the new "started" status was created, I reset the druid and it seemed to be working at the earlier steps in assemblyWF and accessionWF, but then always looked like it had stalled out at "waiting" at preservation-ingest-initiated.

Since the steps never showed "queued" or "started", I thought I needed to reset it to start it up again, so I did that a few times after waiting a few hours between each reset. I only realized this morning that the resets were actually starting new background jobs. As of this moment there are five different PreserveJobs for this druid on Sidekiq.

Expected behavior Argo should accurately report when a job is still running on a druid, so that accessioners know they should keep waiting for it to finish.

Screenshots Currently, Argo shows that preservation-ingest-initiated has been at "waiting" for 12 hours, which matches when I reset the accessionWF last night:

Screen Shot 2020-01-30 at 12 09 32 PM

Sidekiq shows five jobs for this druid at the moment:

Screen Shot 2020-01-30 at 12 08 40 PM

Additional context I'm not sure which step does which task, but I believe these two steps combine to:

Verify checksums in the contentMetadata
Create the bag to be sent to preservation
Trigger the preservationIngestWF and tell the preservation-robots to start running.

For large set of data these jobs will run for an hour or more, so we really need to be able to see the status during this time.

jcoyne commented 4 years ago

@andrewjbtw because these steps are skip-queue, they never become queued. See the documentation in https://github.com/sul-dlss/workflow-server-rails/#workflow-service

andrewjbtw commented 4 years ago

That's fine - my main concern is understanding that work is being done at these steps. They did report status differently previously.

I think the 1.3TB druid will eventually work once it's down to just one job, since it escapes the sweeper now.

jcoyne commented 4 years ago

Something is definitely wrong with the multiple Preserve jobs for the same druid though. That should never happen. It looks like that sdr-ingest-transfer step has been started multiple times. I think it we pushed something like this https://github.com/sul-dlss/argo/pull/1788 (waiting for your review) to production situations like this would be easier to debug.

ndushay commented 4 years ago

@jcoyne Andrew couldn't see what was happening, so he restarted the step(s) multiple times.

I have a relatively simple proposal that would make it more obvious that bag creation was in progress at the Argo/workflow level:

separate the dor-services-app calls for "create deposit bag" and "start preservation workflow"
have accessioningWF step that creates deposit bag distinct from the step that starts the preservation workflow.

Work for the above would be:

little bit of coding in dor-services-app
little bit of coding in dor-services-client
little bit of coding in common-accessioning to add the new step
changing the accessioning workflow to add the new step.
testing in stage

Clearly, the above wouldn't affect the status reported, but it would make it more obvious that preservationIngestWF hadn't yet been notified.

reporting the status more accurately is a diff matter. both may be worthy.

ndushay commented 4 years ago

@justinlittman how hard would it be to display "started" in the argo interface (to distinguish between 'queued' = waiting and it-is-now-running status)?

andrewjbtw commented 4 years ago

Right, I manually restarted sdr-ingest-transfer multiple times on that one because it never looked like it was actually doing anything. I knew the step should take hours, so Argo reporting it as completed and then moving on almost immediately was very confusing. I checked /dor/export to see if the bag was being created and I checked the robots console to see if a job was running.

The events service will help, but as long as Argo reports workflow step status the way it does now, those statuses should be accurate, or at least not misleading. Even if it just says something like "started" for hours while the background job runs.

justinlittman commented 4 years ago

So I would assume/hope that it would display whatever the status is and "started" doesn't need additional coding.

I think the problem is that some steps are handled by dor-services-app. While they may seem like workflow steps, they are actually events. Justin and I had some discussion about these when "started" was added. @jcoyne want to step in here?

ndushay commented 4 years ago

Ndushay> do we have an easy way to identify those steps that will wait for things that happen in the background?? JCoyne> not in the UI Andrew> this might be the only one at the moment - for shelve and publish I know to wait for shelve-complete and publish-complete Andrew > there may be other steps like this but they don’t take as long so I’ve never noticed

ndushay commented 4 years ago

Noted in Production Priorities spreadsheet as Hold for @vivnwong to set up meeting. Noting here that Andrew was okay waiting for Vivian to do this.

ndushay commented 4 years ago

Possibly could use same approach as that used by techMd for interactions ... in argo??? Not sure. Briefly discussed Mon 4/20 in planning.

Andrew would like to be able to reset steps in argo, without a lot of extra steps.

Andrew would ALSO like to include publish-complete shelve-complete but they don't error.

justinlittman commented 4 years ago

The approach used in technical metadata is:

Robot starts technical-metadata step.
Robot sets technical-metadata workflow status to started: https://github.com/sul-dlss/lyber-core/blob/master/lib/lyber_core/robot.rb#L53
Robot calls technical metadata service.
Robot returns noop, which means technical-metadata workflow status is not changed: https://github.com/sul-dlss/common-accessioning/blob/master/lib/robots/dor_repo/accession/technical_metadata.rb#L30 (The workflow status for a step is normally set to completed after a robot is done.)
Technical metadata service does its work.
Technical metadata service sets technical-metadata workflow status to completed when done: https://github.com/sul-dlss/technical-metadata-service/blob/master/app/jobs/technical_metadata_workflow_job.rb#L24

Seems like the same approach could be taken for publish, shelve, and preserve.

sul-dlss / argo

preservation-ingest-initiated, publish-complete, shelve-complete workflow steps should be resettable #1789