populationgenomics / production-pipelines

Genomics workflows for CPG using Hail Batch
MIT License
6 stars 1 forks source link

Modernise all CramQC output paths and record them in Metamist #987

Open EddieLF opened 2 weeks ago

EddieLF commented 2 weeks ago

The output paths to the files created by the CramQC stage of the Seqr Loader and Large Cohort pipelines are organised with a function which is used by the stage's expected_outputs method.

The function creates output paths for each of the many CramQC outputs, which are then aggregated in the CramMultiQC stage. The function and related classes Qc and QcOut importantly contain keys which are referenced by the CramMultiQC stage and are required by the MultiQC report aggregating tool.

The outputs for each QC metric and sequencing group are written to the paths following the format: gs://cpg-dataset-main/qc/<qc_metric>/CPGXXXXX.<qc_metric>

Suggestions

  1. Record these QC files in Metamist. Currently, these CramQC output files exist in the bucket but they are not recorded in Metamist. When they are aggregated by the MultiQC stage, the qc_functions path generating function is used to check if the files exist for each sequencing group at the expected bucket path. If Metamist is to be the source of truth, then we should log these outputs there, and check for their existence there.

  2. Remove the <qc_metric> from the bucket path prefix. Since the metric name is already in the file extension, I don't think it needs to be in the path prefix as well.

Perhaps the prefix could instead be "cramqc" or something similar to indicate which stage the output came from, as is standard across most of the other stage outputs.

The benefit of this is that it would be easier to find and move QC files as required (e.g. in circumstances like this).

Otherwise, we have to visit 8 separate directories to collect all the QC files:

gs://cpg-dataset-main/qc/verify_bamid/CPGXXXXX.verify-bamid.selfSM
gs://cpg-dataset-main/qc/samtools_stats/CPGXXXXX.samtools-stats
gs://cpg-dataset-main/qc/alignment_summary_metrics/CPGXXXXX.alignment_summary_metrics
gs://cpg-dataset-main/qc/base_distribution_by_cycle_metrics/CPGXXXXX.base_distribution_by_cycle_metrics
gs://cpg-dataset-main/qc/insert_size_metrics/CPGXXXXX.insert_size_metrics
gs://cpg-dataset-main/qc/quality_by_cycle_metrics/CPGXXXXX.quality_by_cycle_metrics
gs://cpg-dataset-main/qc/quality_yield_metrics/CPGXXXXX.quality_yield_metrics
gs://cpg-dataset-main/qc/picard_wgs_metrics/CPGXXXXX.picard-wgs-metrics

Note that we should be careful to maintain the functionality of these outputs and their "keys" as defined by the QcOut class, so as to not impact the MultiQC reports, which rely on the keys to construct the report.

SamBryen commented 2 weeks ago

It would also be really useful to be able to add comments for various QC errors that are in metamist. Eg. 'emailed collaborator about this error xx/xx/xxxx, they will re-sequence the sample"