mskcc / tempo

CCS research pipeline to process WES and WGS TN pairs
https://cmotempo.netlify.com/
12 stars 5 forks source link

Germline: Create test plan for SNV/indels #238

Closed evanbiederstedt closed 4 years ago

evanbiederstedt commented 5 years ago

GATK4 VariantFiltration

https://software.broadinstitute.org/gatk/documentation/tooldocs/4.beta.3/org_broadinstitute_hellbender_tools_walkers_filters_VariantFiltration.php

I have no other ideas which make sense. @kpjonsson ?

It really does matter what these germline variants are used for imho

evanbiederstedt commented 5 years ago

https://software.broadinstitute.org/gatk/documentation/tooldocs/4.beta.3/org_broadinstitute_hellbender_tools_walkers_variantutils_ValidateVariants.php

ValidateVariants \
   -R ref.fasta \
   -V input.vcf \
   --dbsnp dbsnp.vcf

I guess we could try using the latest dbsnp.vcf

That seems pretty standard, no?

kpjonsson commented 5 years ago

As far as I'm concerned, after variant calling we should remove common variants and filter on some basic thresholds to retain seemingly heterozygous variants (e.g. 0.25< VAF <0.5) called with sufficient support. We can then do some variant annotation as well.

This isn't a very experimental task, so I think we can limit any test so something reasonable like seeing what %-age of signed out germline variants in the 1700 exomes we can recover. That will also allow us to tinker with thresholds.

@cband, feel free to chime in.

evanbiederstedt commented 5 years ago

I guess it depends what our goals are with this.

As far as I'm concerned, after variant calling we should remove common variants and filter on some basic thresholds to retain seemingly heterozygous variants (e.g. 0.25< VAF <0.5) called with sufficient support. We can then do some variant annotation as well.

I'm not sure if this is part of a pipeline, or an analysis. I guess the most conversation thing to do would be to annotate the common variants. Is there a potential analysis whereby we would want the common variants?

kpjonsson commented 5 years ago

In general no, with the exception of founder mutations that are common within certain subpopulations (e.g. Ashkenazi-Jewish BRCA mutations).

cband commented 5 years ago

For most of the stuff we do, we have limited utility of common variants and it would be fine to filter them out. But, since these are WES/WGS, I am wondering if we will be better off having a clean set of common variants output by the pipeline so that it can facilitate potential analyses that we cannot think of now. For example, high resolution ancestry estimation, specific analyses associating common variants with different phenotypes, etc. In a perfect world, we would want to capture all inherited variants that are of good quality (that is, excluding CH and other systematic variants) among the sequenced regions for every sample. I know filtering could be challenging; so, is there a possibility to defer this responsibility to the downstream analyst at least for now. Because, it would need somebody's experience working on a project with the common variants to help refine the filters (just a thought).

kpjonsson commented 5 years ago

I think that's reasonable. My idea for this (and all other types of output data) is that, as Barry alludes to, allow for "power users" to choose more detailed output. That would, in this case, be all variants rather than a condensed list.

cband commented 5 years ago

@kpjonsson I think I misunderstood the initial discussion. From your recent comment, we are actually advocating for the same thing :)

evanbiederstedt commented 5 years ago

My idea for this (and all other types of output data) is that, as Barry alludes to, allow for "power users" to choose more detailed output. That would, in this case, be all variants rather than a condensed list.

Would you output both (A) all variants and (B) condensed list? Or a single annotated output?

cband commented 5 years ago

See @kpjonsson's comment.

evanbiederstedt commented 5 years ago

Here are the GIAB samples: HG002, HG003, HG004 (Ashkenazi Trio), HG006, HG007 (Han Chinese)

All of the GIAB FASTQs are here for germline work: /juno/work/taylorlab/cmopipeline/GIAB_samples/fastqs All of the BAMs are here: /juno/work/taylorlab/cmopipeline/GIAB_samples/bams All of the VCFs/BEDs with truth: /juno/work/taylorlab/cmopipeline/GIAB_samples/

Justin Zook: "The easiest way to compare is to use the GA4GH benchmarking tool on precisionFDA, but you can also use hap.py on the command line". I'll share details in person, if desired

gongyixiao commented 4 years ago

We already output detailed output for germline mutation calls now.