Open vivbak opened 3 years ago
Current thinking:
Potential Pre-reqs: Manifest status to have 3 additional options - 'processed', 're-process', 'processing'
Input:
Output: {mt} MatrixTable for new samples to be added to csv CSV file of new samples to be added/updated
Steps (pre-QC)
Connect to Manifest
Record state of manifest -> {current_snapshot}
Identify changed/new samples (samples with status 'uploaded' or 're-process') -> {updated_samples}
Change status of each sample in {updated_samples} to 'processing' (?)
For each sample in {updated_samples} determine the MT predecessor (i.e. the most recent MT that does not contain the sample) -> {mt_options}
Calculate the min of {mt_options}->{mt}
Calculate the diff of {current_snapshot} and {mt} -> {to_be_processed}
Map the samples {to_be_processed} with their locations in the GCS bucket--> csv
return {mt} and csv as inputs to the rest of the joint calling workflow.
Steps (Post-QC)
WIP Data Model https://lucid.app/lucidchart/invitations/accept/f0f24fd0-44d0-43a3-9119-9d5c68b1631e
Background
Sequencing providers, using a service account, will upload a range of files into the gs://cpg-#STACK-upload bucket. These files include CRAM files, gVCF files, etc. * These files will need to be processed into appropriate buckets for further downstream analysis & archival storage.
WIP: https://lucid.app/lucidchart/invitations/accept/8f56b7e5-6be5-45f2-a2fc-518d48ce23ab
Functional Requirements
The upload processor pipeline should:
[Outdated] 2nd March
Version & move appropriate gVCF files into the gs://cpg-$STACK-main bucket.Version & move all other files (e.g. CRAM files) into gs://cpg-$STACK-archive.Validate QC run completion & perform a subsequent ‘Clean Up’ of the gs://cpg-$STACK-main bucketUpdate 3rd March
Inputs: $STACK Airtable Table QC Outputs & Exit Status**
Trigger: Run within a batch workflow, manually triggered.
Current Questions: *Confirmation of all of the input files + organization. I.e. folder per sample? **Exploration into how QC outputs will impact the upload processor pipeline. How should that information feedback in?