lima1 / PureCN

Copy number calling and variant classification using targeted short read sequencing
https://bioconductor.org/packages/devel/bioc/html/PureCN.html
Artistic License 2.0
127 stars 32 forks source link

Panel of Normal VCF file and Mutect2 POPAF #117

Closed npatel22526 closed 4 years ago

npatel22526 commented 4 years ago

Hello,

I've been searching through all the threads for choosing correct VCF file. I have 23 Process matched controls and 3 Tumor samples , where I want to investigate allele specific copy number , LOH and Purity. I am using PureCN 1.17.1 (Also tried PureCN 1.16)

Using the 23 process matched normal I ran CNVkit and generated necessary seg and cnr files

Then, I created Panel of Normal with Mutect2 using GATK's most recent version 4.1.4.1, using that I detected Somatic variants in each of the Tumor samples. I followed GATK tutorial here

These will be my VCFs to be used with PureCN.R correct ? (FYI A quick note for others who might get an error, the VCF created by mutect2 will cause error, unless you split and remove all sites with alternate allele "*")

As per PureCN's vignettes, it is suggested to create a mappingbiasfile using Panel of Normal , but I can not use the VCF file created using Mutect2's Panel of Normal as the version does not output any AD field in the VCF. So I ran haplotype caller to create the Panel of Normal to be used in NormalDB step.

Everything went fine till the last step and when I looked the log file (attached )I found below line

vcf.file has no DB info field for membership in germline databases. Found and used valid population allele frequency > 0.001000

The population allle frequency is the POPAF field of of the Mutect2 VCF file ,when I looked at the VCF file I found that POPAF is always very high (almost always > 1) and that is because it's -log10 of the AF found in normal.

INFO=

So my question is, dose PureCN is applying the filter based on log value or raw AF ? Is it OK to create panel of normal VCF using haplotype caller ?

Best, Nick SampleA1.PureCN.log

lima1 commented 4 years ago

Hi Nick,

I'll try to run the tutorial on some of my test samples. It should read VCFs generated by M2. Give me 2-3 days.

For POPAF filter, it assumes raw. I think there is a way to change the cutoff, but maybe not in the command line interface.

Markus

npatel22526 commented 4 years ago

Hello Markus, Thanks for quick reply. Just to confirm you are testing if NormalDB reads the VCF from Mutect2 panel of normal ?
Regarding POPAF , I see that there is an option to specify the field , I might re-annotate the file with appropriate frequency and that should resolve it without changing the core code .

Best, Nick

lima1 commented 4 years ago

Yes, I think so... let me know if not.

Sent from my iPhone

On Mar 18, 2020, at 12:04 PM, Nihir notifications@github.com wrote:

 Hello Markus, Thanks for quick reply. Just to confirm you are testing if NormalDB reads the VCF from Mutect2 panel of normal ? Regarding POPAF , I see that there is an option to specify the field , I might re-annotate the file with appropriate frequency and that should resolve it without changing the core code .

Best, Nick

— You are receiving this because you commented. Reply to this email directly, view it on GitHub, or unsubscribe.

npatel22526 commented 4 years ago

Hello Markus, I checked the code of NormalDB.R and it seems that it looks for "AD" in the genotype field of VCF that is consistent with the error I am getting. So the control VCF generated via Mutect2 Panel of Normal with this tutorial may not work. Meanwhile, can you comment if VCF generated with Haplotype Caller would have sever effect if used as normal panel VCF to remove mapping biases.

Best, Nick

lima1 commented 4 years ago

Hi Nick,

Yes, the CreateSomaticPanelOfNormals output is unfortunately not sufficient. I think there might be a way of reading the GenomicsDBImport directly and extract all the information we need. I will look into this.

For now, I would recommend using GATK3 to merge the normal VCFs into a multi-sample VCF (see the FAQ in the main vignette). That SHOULD work, as long as the AD genotype field survives the merging.

Did you figure out a good way of calling Mutect2 to get similar results as with Mutect1?

I just started looking at yesterday, but this seems to remove way too many calls:

 gatk Mutect2 -R $REFERENCE \
        -I $BAM \
        -O $OUT/$SAMPLEID/M2/${SAMPLEID}_bwa_mutect2_unfiltered.vcf.gz \
        --germline-resource somatic-hg38_af-only-gnomad.hg38_contigs.vcf.gz \
        --genotype-pon-sites true \
        -pon $GATK4_PON \
        --f1r2-tar-gz $OUT/$SAMPLEID/M2/f1r2.tar.gz 

    gatk LearnReadOrientationModel -I $OUT/$SAMPLEID/M2/f1r2.tar.gz \
        -O $OUT/$SAMPLEID/M2/read-orientation-model.tar.gz

    gatk FilterMutectCalls -R $REFERENCE \
        -V $OUT/$SAMPLEID/M2/${SAMPLEID}_bwa_mutect2_unfiltered.vcf.gz \
        --ob-priors $OUT/$SAMPLEID/M2/read-orientation-model.tar.gz \
        -O $OUT/$SAMPLEID/M2/${SAMPLEID}_bwa_mutect2.vcf.gz
npatel22526 commented 4 years ago

Thanks for quick response. So it seems HC is not advised or tested. I will try merging the VCF with GATK3.

Regarding Mutect2, yes it filters out lot of calls. To remedy that use the the flag "-genotype-germline-sites" in the first step, it's experimental but seems to add quite a lot of variants back into VCF. https://gatk.broadinstitute.org/hc/en-us/articles/360041416112-Mutect2#--genotype-germline-sites

Thanks again for all your help. Best, Nick

lima1 commented 4 years ago

Yes, it's probably best to use M2 for both tumor and PoN. If you are only interested in copy number and LOH, using HC for both tumor and PoN should work in theory. If not, please feel free to add some details and I'll try to reproduce.

Thanks, will try the flag and keep you posted here.

npatel22526 commented 4 years ago

Hello Markus,

Updating on my efforts thought would be helpful to others. The goal here is to create a multi-sample VCF from control samples using M2, hence, the first thought would come to mind would be to follow this tutorial to create panel of normal using Mutect2. But issues as stated above is that the final PON vcf is not compatible. Neither, merging intermediate VCFs using GATK3 solves the problem as FMT fields becomes unreliable. I tried BCFtools as well but not confident if either VCF should be used in downstream analysis.

A potential solution would be to create a multiple sample VCF as we would create a tumor VCF ? The command could be like this

 gatk Mutect2 -R $REFERENCE \
        -genotype-germline-sites true \
        -I $BAM_Sample1 \
        -I $BAM_Sample2 \
        -I $BAM_Sample3 \
        -I $BAM_Sample4 \
        -O $OUT/PON.vcf.gz \

Please let me know your thoughts on this. Unfortunately, I don't have any gold standard in hand so I can't validate the results but if you could test it, that would a simple solution for many who are ready to switch to M2.

Best, Nick

lima1 commented 4 years ago

Unreliable? We only need the AD field of real calls (this step is mostly to adjust biased allelic fractions of germline SNPs - we hope that artifacts are already filtered or labelled).

But I'll try to reproduce and compare against M1 this week. Interesting idea with M2 directly, I'll have a look and compare that one too.

lima1 commented 4 years ago

The POPAF issue should be fixed now.

npatel22526 commented 4 years ago

OK great thanks for the update. For now above strategizes with updated POPAF are working well, so closing the issue.

Best, Nick

lima1 commented 4 years ago

@npatel22526 there is now experimental support for GenomicsDB. Requires the https://github.com/nalinigans/GenomicsDB-R package. Simply replace the VCF with the GenomicsDB path (e.g. pon_db/) in NormalDB.R --normal_panel.