Closed evanbiederstedt closed 5 years ago
Will need to make:
roslin-qc
-related scriptscdna contam
hotspots in normals
cutadapt summary
minor contam
gcbias metrics
insert size histogram
hsmetrics
markdups metrics
I will take point on this. It can be a lot of work.
It doesn't look like these are assay-specific, but I might be wrong. The *py
scripts Allan created look solid---we'll see if these do the trick. (Pray for Allan)
Hotspots in normals can't be done because it needs a pairing file
(which we could try to hack in) and a fillout file
(which we are not making)
We won't need cutadapt summary
either, as we aren't doing clipping.
markdups metrics
is covered here: https://github.com/mskcc/vaporware/issues/390
So we don't need that either. hsmetrics
is covered here: https://github.com/mskcc/vaporware/issues/389
I think the only thing we need therefore is:
cdna contam
minor contam
gcbias metrics
insert size histogram
with the caveat that I'm not sure what @kpjonsson thinks about the hotspots normal issue
EDIT: and the other caveat that I'm pretty sure the Roslin scripts will work for us...I don't think there's anything assay-specific here.
minor contam
needs a fingerprint summary file. ROSLIN generates it with https://github.com/mskcc/roslin-qc/blob/master/analyze_fingerprint.py, which requires a pairing file
and a grouping file
. I am still not sure if we easily pass around a pairing file
, but I know for sure we're not making a grouping file
.
Unless there's another way to make a fingerprint summary, we can't do minor contam
with this method.
gcbias metrics
requires making hstmetrics
files upstream.
I'm seeing a lot of processes that's missing that will need to get implemented, mostly some calls to picard
. This will have to be added for gcbias metrics
and insert size histogram
.
It will need to be a process like what's implemented in https://github.com/mskcc/roslin-variant/blob/2.5.x/setup/cwl/modules/sample/gather-metrics-sample.cwl
EDIT: and the other caveat that I'm pretty sure the Roslin scripts will work for us...I don't think there's anything assay-specific here.
It's not about "assay-specific" - it's about whether or not we even have the files the scripts are expecting.
We'll need to add processes which calls GATK4 (which uses Picard):
It looks like we need:
It's not about "assay-specific" - it's about whether or not we even have the files the scripts are expecting.
Precisely, which depends on the assay. That being said, I think the scripts @allanbolipata has will work on these outputs.
https://software.broadinstitute.org/gatk/documentation/tooldocs/4.0.1.0/picard_analysis_CollectGcBiasMetrics.php https://software.broadinstitute.org/gatk/documentation/tooldocs/4.0.7.0/picard_analysis_CollectAlignmentSummaryMetrics.php https://software.broadinstitute.org/gatk/documentation/tooldocs/3.8-0/org_broadinstitute_gatk_tools_walkers_coverage_DepthOfCoverage.php
Fun fun fun
Some comments, partially rehashing what has already been said:
cdna contam
I don't understand the [roslin-qc
] (https://github.com/mskcc/roslin-qc/blob/master/create_cdna_contam.py) script. ~Is @timosong the right person to ask about this?~
According to @timosong this script is currently only run on Delly output from IMPACT/Hemepact, not exomes. It "looks for deletion events that only occur at splice sites [...] .We’re assuming that it is a low chance of happening as an actual mutation, and tag that as possible cDNA contamination".
hotspots in normals
I think this is a DMP-ported QC metric which should not be essential for this pipeline, in my opinion. However, this could somewhat easily be done by spiking hotspots into the positions that Conpair genotypes or genotype them separately.
minor contam
Also a DMP-ported QC metric that we should incorporate. This is estimated from minor allele frequencies at heterozygous SNPs. We should be able to get this out of Conpair. As well, we should be able to derive major contam
from Conpair output–this is based on the fraction of heterozygous SNPs in a sample. Note that–I think–both of these metrics are based on a predefined list of fingerprinting SNPs, maybe Roslin uses some version of this.
gcbias metrics
Probably sufficient with Alfred output?
insert size histogram
Probably sufficient with Alfred output?
hsmetrics
Not sure which numbers from this output are used that are not in the Alfred output.
markdups metrics
Probably sufficient with Alfred/GATK MarkDuplicates output?
FYI: There's example output from Alfred here: https://gear.embl.de/alfred
RE: cdna contam
Let's not use this.
RE: gcbias metrics
, insert size histrogram
This is definitely within Alfred:
In addition to standard QC metrics such as GC bias, base composition, insert size and sequencing coverage distributions it supports haplotype-aware and allele-specific feature counting and feature annotation.
Apologies; I took Barry too literally. So we won't be using Picard for this.
RE: hsmetrics
The way I currently understand this is that there are two Picard functions relevant:
For WES data, there is https://software.broadinstitute.org/gatk/documentation/tooldocs/4.0.5.1/picard_analysis_directed_CollectHsMetrics.php
For WGS data, there is https://software.broadinstitute.org/gatk/documentation/tooldocs/4.0.6.0/picard_analysis_CollectWgsMetrics.php
However, for WGS, I don't see any metrics not given by Alfred. For WES, I don't think Alfred does this take....so I think we'll need to use CollectHsMetrics
from GATK4
(which is now Picard) only for WES inputs. We can use a when
statement.
RE: markdups metrics
Probably sufficient with Alfred/GATK MarkDuplicates output?
Yes, we are already doing the exact same thing with MarkDuplicates.
RE: minor contam
Also a DMP-ported QC metric that we should incorporate. This is estimated from minor allele frequencies at heterozygous SNPs. We should be able to get this out of Conpair. As well, we should be able to derive major contam from Conpair output–this is based on the fraction of heterozygous SNPs in a sample. Note that–I think–both of these metrics are based on a predefined list of fingerprinting SNPs, maybe Roslin uses some version of this.
I haven't looked closely at the Conpair outputs, or what Roslin does
My idea is to write an Rmarkdown script (or something similar) that aggregates QC data across samples and produces graphical and tabular output.
My idea is to write an Rmarkdown script (or something similar) that aggregates QC data across samples and produces graphical and tabular output.
That's what I have in mind. Every "per patient" stats can be a tsv (a tabular txt works). Nifty.
Re: the fingerprinting SNPs, from the Conpair paper:
Selection of informative genomic markers (GRCh37/hg19) (Supplement, p. 1-2)
The selected 7387 markers meet the following criteria:
- SNVs (easier to genotype from sequencing data)
- exonic (to allow comparison of exome and WGS samples)
- located on autosomes (to have estimates that are consistent across both sexes)
- minor allele frequency (MAF) ≥ 40%, estimated across all populations in the 1000 Genomes Project (Consortium, 2012), phase 3 dataset*
- linkage disequilibrium (LD) between any two markers < 0.8
From my memory, this reasoning is similar to that for the selection of fingerprinting sites from the DMP.
From Ahmet:
we have a list of 1024 SNPs specifically tiled in the IMPACT panel that were selected to be MAF > 50% by Mike years ago (similar to conpair but without the restrictions of them being exonic since they are captured). they’re randomly selected across the chromosomes and help with copy number analysis too
minor contam
is actually calculated at homozygous sites in the normal, and we look for presence of alternate alleles in the tumor at these sites. the alternate allele presence should not be more than sequencing error which is around 1%. If a case has on aveage > 2% alternate alleles, then it’s considered to have minor contamination and we adjust the mutation filters accordingly
So I think @allanbolipata need not worry. Now that we've walked through this, I think this is basically all done----naturally, we'll have to wrangler the outputs into something pretty, but I can do that with @kpjonsson
I'll add the process CollectHSMetrics for GATK4; I already have a branch going.
OK, perfect. I'll remove ROSLIN-QC stuff as it's become clear it's actually not needed.
I'll make a process that will serve as a placeholder, I'll put the input files in there that I think we'll need. Then we can continue adding from there
Sounds like a good start.
Note here:
There's example output from Alfred here: https://gear.embl.de/alfred
I like this layout better than two ggplot2 plots squashed within a single page.
I believe this issue served it's purpose
"""All run-level QC we do now post-pipeline (historical IMPACT/Roslin/lab)"""
What would this look like @nikhil @allanbolipata