lima1 / PureCN

Copy number calling and variant classification using targeted short read sequencing
https://bioconductor.org/packages/devel/bioc/html/PureCN.html
Artistic License 2.0
127 stars 32 forks source link

GDC GATK version incompatible with PureCN? #259

Closed ghost closed 1 year ago

ghost commented 1 year ago

I am trying to use the vcf files generated from WES data in the TCGA (https://portal.gdc.cancer.gov/repository). I tried running PureCN using the MuTect2 vcfs (controlled access), but this resulted in an error:

Error: Segmentation and VCF do not overlap.

I noticed in your documentation that MuTect2 from GATK <4.1.7 will not work with PureCN. The GDC pipeline website shows the MuTect2 vcfs were generated with GATK 4.0.4.0 (https://docs.gdc.cancer.gov/Data/Bioinformatics_Pipelines/DNA_Seq_Variant_Calling_Pipeline/).

Is this an issue with GDC vcfs?

java -Djava.io.tmpdir=/tmp/job_tmp_3 -d64 -jar -Xmx3G -XX:+UseSerialGC \ /bin/gatk-4.0.4.0/gatk-package-4.0.4.0-local.jar \ Mutect2 \ -R GRCh38.d1.vd1.fa \ -L chr4:1-190214555 \ # Specify chromosome -I Tumor_Sample_Alignment.bam \ -O 3.mt2.vcf \ -tumor \ # From step 4 --af-of-alleles-not-in-resource 2.5e-06 \ --germline-resource af-only-gnomad.hg38.vcf.gz \ # Germline reference from gnomad -pon gatk4_mutect2_4136_pon.vcf.gz # New panel of normal created by 4136 TCGA curated normal samples, using GATK4

lima1 commented 1 year ago

Hi,

you would need VCFs that contain germline SNPs. Looks like this is tumor-only, so you might be fine, but not sure if they remove the matches in the germline resource. Cleaned from artifacts, but otherwise all variants, germline and somatic, should be there. The vignettes list the parameters you need to get the germline sites.

They also cleaned up the VCF specs quite a bit between 4.0 and 4.1.7. I can't remember why I added that disclaimer, but fields for population allele frequencies, base quality scores etc. changed and are probably not parsed correctly. You might be able to make it work though now with most VCF field names being configurable.

Markus