See the context here in the background of the scoping document
Summary
This pull request proposes separating the alignment and genotyping stages into two distinct workflows within the production pipelines. These changes aim to enhance flexibility, control, and allow future changes to be on parity with industry best practices, particularly those established by Illumina's DRAGEN hardware. Additionally, version control, once implemented will be more easily implemented with separated alignment and genotyping from downstream workflows.
Proposed Changes
Separation of Pipelines:
Alignment and Genotyping: Split into standalone pipelines to prevent versioning conflicts and ensure consistent inputs.
Modularity: Allows independent updates and optimisations for each workflow without mutual disruption.
Pipeline Starting Points:
Detection of CRAM and gVCF Files: Adjust pipelines to ensure clear and informative error messages for missing data, and allow manual triggering of the genotyping pipeline as needed.
Resource Dependencies: Replace stage dependencies with resource dependencies, requiring a CPG-processed CRAM in Metamist before running the genotyping pipeline, as well as either fastq or CRAM files registered in Metamist prior to running the alignment pipeline.
Repository Structure:
Current Limitations: The existing structure is not conducive to navigating and understanding independent pipelines.
Proposed Structure: Consider separate repositories for the production pipelines API and actual pipelines, with clear folder structures for individual pipelines and shared resources. This PR implements an interim folder structure.
This PR will provide an interim folder structure for the alignment and genotyping pipelines.
Considerations
Integration with Custom Cohorts: Ensure that updates to cohorts with additional samples or sequencing groups are managed without disrupting the pipeline.
User Responsibilities: Users must ensure that samples in custom cohorts have the required gVCF files before running the pipeline.
Further Discussion: Topics such as repository restructuring and shared resource versioning require further discussion.
Version control: Yet to be defined but this separation will hopefully help future efforts in this domain
Breaking continuity: Production-pipelines is designed to automatically trigger stages when inputs do not exist. Extracting alignment and genotyping pipelines from downstream workflows breaks this continuity. Users will need to manually trigger alignment and/or genotyping pipelines to ensure that all sequencing groups have the correct input for downstream analysis. This break in continuity, although contrary to the design logic of production-pipelines, ensures the separation of pipelines for future version control efforts and prevents erroneous pipeline runs without the correct inputs for all samples.
See the context here in the background of the scoping document
Summary This pull request proposes separating the alignment and genotyping stages into two distinct workflows within the production pipelines. These changes aim to enhance flexibility, control, and allow future changes to be on parity with industry best practices, particularly those established by Illumina's DRAGEN hardware. Additionally, version control, once implemented will be more easily implemented with separated alignment and genotyping from downstream workflows.
Proposed Changes
Pipeline Starting Points:
Repository Structure:
Considerations