microsoft / ga4gh-tes

C# implementation of the GA4GH TES API; provides distributed batch task execution on Microsoft Azure
MIT License
34 stars 27 forks source link

Tes compliance: support multiple executors #427

Open uniqueg opened 1 year ago

uniqueg commented 1 year ago

Problem:

As per the current TES specification, multiple executors can be specified in a single request:

components:
  ...
  schemas:
    ...
    tesTask:
      ...
      properties:
        ...
        executors:
          type: array
          description: |-
            An array of executors to be run. Each of the executors will run one
            at a time sequentially. Each executor is a different command that
            will be run, and each can utilize a different docker image. But each of
            the executors will see the same mapped inputs and volumes that are declared
            in the parent CreateTask message.

            Execution stops on the first error.
          items:
            $ref: '#/components/schemas/tesExecutor'

Since TES v1.1, the behavior for the case in which one of multiple executors fails can further be modified with the ignore_error property of the tesExecutor schema:

components:
  ...
  schemas:
    ...
    tesExecutor:
      ...
      properties:
        ...
        ignore_error:
          type: boolean
          description: |-
            Default behavior of running an array of executors is that execution
            stops on the first error. If `ignore_error` is `True`, then the
            runner will record error exit codes, but will continue on to the next
            tesExecutor.

As discussed with @MattMcL4475, multiple executors are currently not supported in TES on Azure.

However, some client implementations such as workflow engines may rely on this feature, e.g., for setup and teardown steps or for scheduling two or more closely related tasks with similar requirements to minimize overheads.

Solution:

  1. Implement support for multiple executors as described.
  2. Implement the ignore_error property of the tesExecutor schema as described.

Describe alternatives you've considered N/A

Code dependencies Not sure.

Additional context Appropriate tests for this behavior are currently unavailable (see https://github.com/ga4gh/compliance-tests-ga4gh-tes/issues/2).

MattMcL4475 commented 1 month ago

@adamnovak would implementing this issue be a significant help to implementing TES with Toil? We are doing a round of issue prioritization and your input on this would be helpful, thanks!

adamnovak commented 1 month ago

I think that having access to multiple executors would be a big help for allowing Toil to efficiently implement some parts of WDL, such as the glob() function. I think if we could run some Toil code against the same volumes as the user container, and use TES's ability to set up a wildcard match for outputs like output_files/*/*, it would be possible to fully implement WDL without uploading any unnecessary data to shared storage. We might still be able to achieve this without multiple executors, but it would involve either implementing a lot of WDL in Bash or else somehow injecting enough Python to run Toil into the user container.

Without being able to run Toil code against the same volumes, for WDL tasks where the output files need to be dynamically determined, we'd need to upload everything that potentially could be an output to shared storage and then go through it to identify the actual outputs.

uniqueg commented 1 month ago

@MattMcL4475: Please also note that, as part of the core TES specs, TES compliant clients will rely on multiple executors to be supported. While so far we are making use of multiple executors to support individual server-side solutions (and so are not affected by TES-on-Azure's non-compliance), there is at least one project that we are working on (supporting Crypt4GH files in TES) that uses a middleware approach relying on multiple executors. But since we believe that the middleware-executor pattern is powerful for supercharging any TES implementation with individual functionalities, I think there will be more of those use cases that TES-on-Azure, without support for multiple executors, would miss out on.

MattMcL4475 commented 1 month ago

Thank you @adamnovak and @uniqueg for the feedback, we've made this a top priority and we'll have it implemented and released in 5.4.4 this week. We also noticed that with this change, the repo will pass 100% of the TES 1.1 compliance tests (currently it passes 100% of TES 1.0)! So it will be great to get this done.

Please keep the feedback coming if you see any high-priority features/bugs, thank you!