include-dcc / DMC_v3_tasks

Issues for DMC v3 project board
0 stars 0 forks source link

Automate scheduling, triggering and monitoring of workflows on CAVATICA #29

Open ByroneCole-SageBionetworks opened 1 year ago

ByroneCole-SageBionetworks commented 1 year ago
thomasyu888 commented 1 year ago

We reached out to CHOP (Eric, Allison, Yuankun) to answer questions we had about the CHOP BixOps. Here is the link to the slack thread: https://teaminclude.slack.com/archives/C03K4BHD3QC/p1673973750582959

thomasyu888 commented 1 year ago

On the datahops call on 3/23/2023, we stated that Sage demonstrated the ability to launch and monitor workflows on Cavatica using Orca so we are calling this particular portion done.

The rest of the bullet points are BONUS, but we will continue to try to tackle them.

thomasyu888 commented 1 year ago

For the V3 tech plan, I am including a draft of the automation of data processing SOP for V4 and the future of this project.

Data Processing with Orca

The genomic data processing for INCLUDE occurs within the Cavatica platform developed and maintained by Velsera. The CHOP team have developed many bioinformatics operation pipelines that capture all the steps prior and post processing on Cavatica. The Sage Bionetworks team is responsible for learning and attempting to automate this process as much as possible. The bulk of the work is in learning the BixOps and then attempting to automate all the steps.

This is a high level flow chart for the steps required to leverage Orca for genomic data processing in INCLUDE. (Lucid link)

image

On a high level:

Depending on the complexity and completeness of any BixOps workflow, we estimate that it will take around a quarter to fully automate a workflow, if deemed feasible.

  1. Learn the CHOP BixOps of any particular workflow and manually trigger the processing.
    1. Manually set up Cavatica project with KF app and reference files
    2. Manually upload and prepare dataset for processing
    3. Process the dataset with given KF App
    4. Set up delivery Cavatica project and populate with workflow input / output
  2. Attempt to automate steps by stringing together steps via Python script
  3. If possible ^, the steps might look like
    1. Notify CHOP of new dataset
    2. Each step in the CHOP BixOps
    3. Execute workflow on subset of data
    4. Validate output
    5. Execute workflow on production dataset
    6. Validate output
    7. Notify collaborators
briandoconnor commented 1 year ago

Only comment, think about capacity... given our funding, how many "Execute Orca recipe on entire dataset" can we do in a given quarter?

briandoconnor commented 1 year ago

Other thing... think about adding a flow step for "trigger notification of CHOP"... and adding status back to DFA as that's deployed for INCLUDE