phac-nml / irida-next

IRIDA Next
https://phac-nml.github.io/irida-next/
Apache License 2.0
8 stars 3 forks source link

Workflow executions: Requeue jobs #546

Closed JeffreyThiessen closed 5 months ago

JeffreyThiessen commented 5 months ago

What does this PR do and why?

Describe in detail what your merge request does and why. Fixes #513

  1. Adds retry and error rescue to jobs that connect to ga4gh wes. Details below.

  2. Fixes a race condition bug where a job that a user cancelled sometimes finishes and completes(writes to samples) instead of cancelling.

  3. Created a new set of helper functions ActiveJobTestHelpers to allow mocking Faraday based API integrations for active job tests.

Screenshots or screen recordings

Screenshots are required for UI changes, and strongly recommended for all other pull requests.

How to set up and validate locally

Numbered steps to set up and validate the change are strongly suggested.

You will need to have Sapporo running and connected to test

The following 3 jobs reach out to ga4gh wes.

  1. WorkflowExecutionStatusJob
  2. WorkflowExecutionSubmissionJob
  3. WorkflowExecutionCancelationJob

If ga4gh wes is unreachable or returns an error code, it must be handled so that the workflow execution doesn't get stuck in whatever state it was in when the error occurred.

For ConnectionError type

This error type should always be reattempted.

You can simply turn off your Sapporo instance at the different stages of the workflow execution.

Verify that the job is reattempted continually.

For APIExceptionError types

These error types should be attempted 5 times before being put into the error state

You can make workflow_execution services throwto test the job re-queuing by adding a line to throw the exception. For example:

# app/services/workflow_executions/submission_service.rb#execute
def execute
      return false unless @workflow_execution.prepared?

      raise Integrations::ApiExceptions::BadRequestError, 'my thrown exception'

      run = @wes_client.run_workflow(**@workflow_execution.as_wes_params)
...

WorkflowExecutionCancelationJob

WorkflowExecutionCancelationJob is a special case because ga4gh wes can return 401 and/or 403 errors when a cancel request is sent to a completed run (Sapporo always replies with 200).

In this case, we want to mark those runs as canceled and not move on to completion steps. Other errors should still behave as above.

PR acceptance checklist

This checklist encourages us to confirm any changes have been analyzed to reduce risks in quality, performance, reliability, security, and maintainability.

github-actions[bot] commented 5 months ago

Simplecov Report

Covered Threshold
92.16% 90%