QC: Aggregated QC - Githubissues

evanbiederstedt commented 5 years ago

"""All run-level QC we do now post-pipeline (historical IMPACT/Roslin/lab)"""

What would this look like @nikhil @allanbolipata

allanbolipata commented 5 years ago

Will need to make:

A container of roslin-qc-related scripts

An independent process that will perform what we can of

cdna contam
hotspots in normals
cutadapt summary
minor contam
gcbias metrics
insert size histogram
hsmetrics
markdups metrics

I will take point on this. It can be a lot of work.

evanbiederstedt commented 5 years ago

It doesn't look like these are assay-specific, but I might be wrong. The *py scripts Allan created look solid---we'll see if these do the trick. (Pray for Allan)

allanbolipata commented 5 years ago

Hotspots in normals can't be done because it needs a pairing file (which we could try to hack in) and a fillout file (which we are not making)

evanbiederstedt commented 5 years ago

We won't need cutadapt summary either, as we aren't doing clipping.

evanbiederstedt commented 5 years ago

markdups metrics is covered here: https://github.com/mskcc/vaporware/issues/390

So we don't need that either. hsmetrics is covered here: https://github.com/mskcc/vaporware/issues/389

I think the only thing we need therefore is:

cdna contam
minor contam
gcbias metrics
insert size histogram

with the caveat that I'm not sure what @kpjonsson thinks about the hotspots normal issue

EDIT: and the other caveat that I'm pretty sure the Roslin scripts will work for us...I don't think there's anything assay-specific here.

allanbolipata commented 5 years ago

minor contam needs a fingerprint summary file. ROSLIN generates it with https://github.com/mskcc/roslin-qc/blob/master/analyze_fingerprint.py, which requires a pairing file and a grouping file. I am still not sure if we easily pass around a pairing file, but I know for sure we're not making a grouping file.

Unless there's another way to make a fingerprint summary, we can't do minor contam with this method.

allanbolipata commented 5 years ago

gcbias metrics requires making hstmetrics files upstream.

allanbolipata commented 5 years ago

I'm seeing a lot of processes that's missing that will need to get implemented, mostly some calls to picard. This will have to be added for gcbias metrics and insert size histogram.

It will need to be a process like what's implemented in https://github.com/mskcc/roslin-variant/blob/2.5.x/setup/cwl/modules/sample/gather-metrics-sample.cwl

allanbolipata commented 5 years ago

EDIT: and the other caveat that I'm pretty sure the Roslin scripts will work for us...I don't think there's anything assay-specific here.

It's not about "assay-specific" - it's about whether or not we even have the files the scripts are expecting.

evanbiederstedt commented 5 years ago

We'll need to add processes which calls GATK4 (which uses Picard):

It looks like we need:

CollectAlignmentSummaryMetrics
DepthOfCoverage
CollectInsertSizeMetrics
CollectGcBiasMetrics

It's not about "assay-specific" - it's about whether or not we even have the files the scripts are expecting.

Precisely, which depends on the assay. That being said, I think the scripts @allanbolipata has will work on these outputs.

https://software.broadinstitute.org/gatk/documentation/tooldocs/4.0.1.0/picard_analysis_CollectGcBiasMetrics.php https://software.broadinstitute.org/gatk/documentation/tooldocs/4.0.7.0/picard_analysis_CollectAlignmentSummaryMetrics.php https://software.broadinstitute.org/gatk/documentation/tooldocs/3.8-0/org_broadinstitute_gatk_tools_walkers_coverage_DepthOfCoverage.php

Fun fun fun

kpjonsson commented 5 years ago

Some comments, partially rehashing what has already been said:

cdna contam I don't understand the [roslin-qc] (https://github.com/mskcc/roslin-qc/blob/master/create_cdna_contam.py) script. ~Is @timosong the right person to ask about this?~ According to @timosong this script is currently only run on Delly output from IMPACT/Hemepact, not exomes. It "looks for deletion events that only occur at splice sites [...] .We’re assuming that it is a low chance of happening as an actual mutation, and tag that as possible cDNA contamination".
hotspots in normals I think this is a DMP-ported QC metric which should not be essential for this pipeline, in my opinion. However, this could somewhat easily be done by spiking hotspots into the positions that Conpair genotypes or genotype them separately.
minor contam Also a DMP-ported QC metric that we should incorporate. This is estimated from minor allele frequencies at heterozygous SNPs. We should be able to get this out of Conpair. As well, we should be able to derive major contam from Conpair output–this is based on the fraction of heterozygous SNPs in a sample. Note that–I think–both of these metrics are based on a predefined list of fingerprinting SNPs, maybe Roslin uses some version of this.
gcbias metrics Probably sufficient with Alfred output?
insert size histogram Probably sufficient with Alfred output?
hsmetrics Not sure which numbers from this output are used that are not in the Alfred output.
markdups metrics Probably sufficient with Alfred/GATK MarkDuplicates output?

FYI: There's example output from Alfred here: https://gear.embl.de/alfred

evanbiederstedt commented 5 years ago

RE: cdna contam

Let's not use this.

RE: gcbias metrics, insert size histrogram

This is definitely within Alfred:

In addition to standard QC metrics such as GC bias, base composition, insert size and sequencing coverage distributions it supports haplotype-aware and allele-specific feature counting and feature annotation.

Apologies; I took Barry too literally. So we won't be using Picard for this.

RE: hsmetrics

The way I currently understand this is that there are two Picard functions relevant:

For WES data, there is https://software.broadinstitute.org/gatk/documentation/tooldocs/4.0.5.1/picard_analysis_directed_CollectHsMetrics.php
For WGS data, there is https://software.broadinstitute.org/gatk/documentation/tooldocs/4.0.6.0/picard_analysis_CollectWgsMetrics.php

However, for WGS, I don't see any metrics not given by Alfred. For WES, I don't think Alfred does this take....so I think we'll need to use CollectHsMetrics from GATK4 (which is now Picard) only for WES inputs. We can use a when statement.

RE: markdups metrics

Probably sufficient with Alfred/GATK MarkDuplicates output?

Yes, we are already doing the exact same thing with MarkDuplicates.

RE: minor contam

Also a DMP-ported QC metric that we should incorporate. This is estimated from minor allele frequencies at heterozygous SNPs. We should be able to get this out of Conpair. As well, we should be able to derive major contam from Conpair output–this is based on the fraction of heterozygous SNPs in a sample. Note that–I think–both of these metrics are based on a predefined list of fingerprinting SNPs, maybe Roslin uses some version of this.

I haven't looked closely at the Conpair outputs, or what Roslin does

kpjonsson commented 5 years ago

My idea is to write an Rmarkdown script (or something similar) that aggregates QC data across samples and produces graphical and tabular output.

evanbiederstedt commented 5 years ago

My idea is to write an Rmarkdown script (or something similar) that aggregates QC data across samples and produces graphical and tabular output.

That's what I have in mind. Every "per patient" stats can be a tsv (a tabular txt works). Nifty.

kpjonsson commented 5 years ago

Re: the fingerprinting SNPs, from the Conpair paper:

Selection of informative genomic markers (GRCh37/hg19) (Supplement, p. 1-2)

The selected 7387 markers meet the following criteria:

SNVs (easier to genotype from sequencing data)

exonic (to allow comparison of exome and WGS samples)

located on autosomes (to have estimates that are consistent across both sexes)

minor allele frequency (MAF) ≥ 40%, estimated across all populations in the 1000 Genomes Project (Consortium, 2012), phase 3 dataset*

linkage disequilibrium (LD) between any two markers < 0.8

From my memory, this reasoning is similar to that for the selection of fingerprinting sites from the DMP.

evanbiederstedt commented 5 years ago

From Ahmet:

we have a list of 1024 SNPs specifically tiled in the IMPACT panel that were selected to be MAF > 50% by Mike years ago (similar to conpair but without the restrictions of them being exonic since they are captured). they’re randomly selected across the chromosomes and help with copy number analysis too

minor contam is actually calculated at homozygous sites in the normal, and we look for presence of alternate alleles in the tumor at these sites. the alternate allele presence should not be more than sequencing error which is around 1%. If a case has on aveage > 2% alternate alleles, then it’s considered to have minor contamination and we adjust the mutation filters accordingly

So I think @allanbolipata need not worry. Now that we've walked through this, I think this is basically all done----naturally, we'll have to wrangler the outputs into something pretty, but I can do that with @kpjonsson

I'll add the process CollectHSMetrics for GATK4; I already have a branch going.

allanbolipata commented 5 years ago

OK, perfect. I'll remove ROSLIN-QC stuff as it's become clear it's actually not needed.

I'll make a process that will serve as a placeholder, I'll put the input files in there that I think we'll need. Then we can continue adding from there

kpjonsson commented 5 years ago

Sounds like a good start.

evanbiederstedt commented 5 years ago

Note here:

There's example output from Alfred here: https://gear.embl.de/alfred

I like this layout better than two ggplot2 plots squashed within a single page.

evanbiederstedt commented 5 years ago

I believe this issue served it's purpose

mskcc / tempo

QC: Aggregated QC #393