The output paths to the files created by the CramQC stage of the Seqr Loader and Large Cohort pipelines are organised with a function which is used by the stage's expected_outputs method.
The function creates output paths for each of the many CramQC outputs, which are then aggregated in the CramMultiQC stage. The function and related classes Qc and QcOut importantly contain keys which are referenced by the CramMultiQC stage and are required by the MultiQC report aggregating tool.
The outputs for each QC metric and sequencing group are written to the paths following the format:
gs://cpg-dataset-main/qc/<qc_metric>/CPGXXXXX.<qc_metric>
Suggestions
Record these QC files in Metamist. Currently, these CramQC output files exist in the bucket but they are not recorded in Metamist. When they are aggregated by the MultiQC stage, the qc_functions path generating function is used to check if the files exist for each sequencing group at the expected bucket path. If Metamist is to be the source of truth, then we should log these outputs there, and check for their existence there.
Remove the <qc_metric> from the bucket path prefix. Since the metric name is already in the file extension, I don't think it needs to be in the path prefix as well.
Perhaps the prefix could instead be "cramqc" or something similar to indicate which stage the output came from, as is standard across most of the other stage outputs.
The benefit of this is that it would be easier to find and move QC files as required (e.g. in circumstances like this).
Otherwise, we have to visit 8 separate directories to collect all the QC files:
Note that we should be careful to maintain the functionality of these outputs and their "keys" as defined by the QcOut class, so as to not impact the MultiQC reports, which rely on the keys to construct the report.
It would also be really useful to be able to add comments for various QC errors that are in metamist. Eg. 'emailed collaborator about this error xx/xx/xxxx, they will re-sequence the sample"
The output paths to the files created by the CramQC stage of the Seqr Loader and Large Cohort pipelines are organised with a function which is used by the stage's
expected_outputs
method.The function creates output paths for each of the many CramQC outputs, which are then aggregated in the CramMultiQC stage. The function and related classes
Qc
andQcOut
importantly contain keys which are referenced by the CramMultiQC stage and are required by the MultiQC report aggregating tool.The outputs for each QC metric and sequencing group are written to the paths following the format:
gs://cpg-dataset-main/qc/<qc_metric>/CPGXXXXX.<qc_metric>
Suggestions
Record these QC files in Metamist. Currently, these CramQC output files exist in the bucket but they are not recorded in Metamist. When they are aggregated by the MultiQC stage, the
qc_functions
path generating function is used to check if the files exist for each sequencing group at the expected bucket path. If Metamist is to be the source of truth, then we should log these outputs there, and check for their existence there.Remove the
<qc_metric>
from the bucket path prefix. Since the metric name is already in the file extension, I don't think it needs to be in the path prefix as well.Perhaps the prefix could instead be
"cramqc"
or something similar to indicate which stage the output came from, as is standard across most of the other stage outputs.The benefit of this is that it would be easier to find and move QC files as required (e.g. in circumstances like this).
Otherwise, we have to visit 8 separate directories to collect all the QC files:
Note that we should be careful to maintain the functionality of these outputs and their "keys" as defined by the
QcOut
class, so as to not impact the MultiQC reports, which rely on the keys to construct the report.