populationgenomics / production-pipelines

Genomics workflows for CPG using Hail Batch
MIT License
2 stars 0 forks source link

Extract 'Alignment' and 'Genotype' to be separate independent pipelines #783

Open michael-harper opened 4 weeks ago

michael-harper commented 4 weeks ago

See the context here in the background of the scoping document

Summary This pull request proposes separating the alignment and genotyping stages into two distinct workflows within the production pipelines. These changes aim to enhance flexibility, control, and allow future changes to be on parity with industry best practices, particularly those established by Illumina's DRAGEN hardware. Additionally, version control, once implemented will be more easily implemented with separated alignment and genotyping from downstream workflows.

Proposed Changes

  1. Separation of Pipelines:
  1. Pipeline Starting Points:

    • Detection of CRAM and gVCF Files: Adjust pipelines to ensure clear and informative error messages for missing data, and allow manual triggering of the genotyping pipeline as needed.
    • Resource Dependencies: Replace stage dependencies with resource dependencies, requiring a CPG-processed CRAM in Metamist before running the genotyping pipeline, as well as either fastq or CRAM files registered in Metamist prior to running the alignment pipeline.
  2. Repository Structure:

    • Current Limitations: The existing structure is not conducive to navigating and understanding independent pipelines.
    • Proposed Structure: Consider separate repositories for the production pipelines API and actual pipelines, with clear folder structures for individual pipelines and shared resources. This PR implements an interim folder structure.
    • This PR will provide an interim folder structure for the alignment and genotyping pipelines.

Considerations