lima1 / PureCN

Copy number calling and variant classification using targeted short read sequencing
https://bioconductor.org/packages/devel/bioc/html/PureCN.html
Artistic License 2.0
127 stars 32 forks source link

Mutect2 VCF error #79

Closed TylerMclaughlin closed 5 years ago

TylerMclaughlin commented 5 years ago

Hi, Markus and PureCN devs! Could use some help getting around a new bug.

So I'm trying to run PureCN for a tumor sample without a matched normal. My VCF comes from Mutect2 and I am using a process-matched normal coverage database 'agilent_capture_kit.rds ' generated by PureCN, as well as 'baits_hg38_intervals.txt' also generated by PureCN.

Here's the command I am running, derived from the quick-start tutorial:

# Without a matched normal (minimal test run)
Rscript $PURECN/PureCN.R --out $OUT/$SAMPLEID \
--tumor $OUT_REF/${SAMPLEID}_tumor-ready_coverage_loess.txt \
--sampleid $SAMPLEID \
--vcf '/data/${SAMPLEID}_tumor/${SAMPLEID}_tumor-mutect2.vcf.gz' \
--normaldb $OUT_REF/agilent_capture_kit.rds \
--intervals $OUT_REF/baits_hg38_intervals.txt \
--genome hg38

And here's the error message:

Error in which(as.numeric(geno(vcf)$MBQ[, tumor.id.in.vcf]) >= min.base.quality) : 
  (list) object cannot be coerced to type 'double'
Calls: runAbsoluteCN ... filterVcfMuTect2 -> filterVcfBasic -> .filterVcfByBQ -> which
In addition: Warning message:
In .bcfHeaderAsSimpleList(header) :
  duplicate keys in header will be forced to unique rownames
Execution halted

I appreciate any help!

lima1 commented 5 years ago

Hi Tyler, which version of GATK4 are you using? This happens also without --normaldb?

TylerMclaughlin commented 5 years ago

Hi Markus. I'm using GATK4. When I omit the --normaldb flag, I get the following:

Error in .getNormalCoverage(normal.coverage.file) : 
  Need either normalDB or normal.coverage.file
Execution halted
lima1 commented 5 years ago

Ah, sorry, forget about --normaldb, I was thinking of --normal_panel.

Which exact version of GATK4? Did they maybe recently remove the MBQ field from the VCF? If there is a MBQ field, can you paste the complete content of Sampleid.log?

TylerMclaughlin commented 5 years ago

the exact GATK version should be 4.0.8.1, via "gatk-package-4.0.8.1-local.jar". The reason I'm unsure is I also have a '-4.0.7.0-local.jar'. Both come with the bcbio installation and the default configuration for MuTect2 is just specified as "GATK 4"

I definitely have an MBQ field. Hmmm..... there's no log file for mutect2. I have a stats file. would that work?

lima1 commented 5 years ago

Sorry, I mean the PureCN log-file.

TylerMclaughlin commented 5 years ago

Sure!

INFO [2019-04-24 19:01:03] ------------------------------------------------------------
INFO [2019-04-24 19:01:03] PureCN 1.12.2
INFO [2019-04-24 19:01:03] ------------------------------------------------------------
INFO [2019-04-24 19:01:03] Arguments: -tumor.coverage.file /unitynfs1/projects/mclaurt/m15-891-project/copy_number_analysis/tumor_only/purecn/output/normal_coverage/13004_tumor-ready_coverage_loess.txt -seg.file  -vcf.file /unitynfs1/projects/mclaurt/m15-891-project/using_bcbio_templates/dna_calls/tumor_only/13004/results/13004_tumor/13004_tumor-mutect2.vcf.gz -genome hg38 -sex ? -args.setPriorVcf 6 -args.setMappingBiasVcf NULL -args.segmentation NULL,NULL,0.005,NULL -sampleid 13004 -min.ploidy 1 -max.ploidy 6 -max.non.clonal 0.2 -max.homozygous.loss 0.05,1e+07 -log.ratio.calibration 0.1 -model.homozygous FALSE -error 0.001 -interval.file /unitynfs1/projects/mclaurt/m15-891-project/copy_number_analysis/tumor_only/purecn/output/normal_coverage/baits_hg38_intervals.txt -max.segments 300 -plot.cnv TRUE -DB.info.flag DB -POPAF.info.field POP_AF -model beta -post.optimize FALSE -BPPARAM  -log.file /unitynfs1/projects/mclaurt/m15-891-project/copy_number_analysis/tumor_only/purecn/output/13004/13004.log -normal.coverage.file <data> -normalDB <data> -args.filterVcf <data> -fun.segmentation <data> -test.num.copy <data> -test.purity <data> -speedup.heuristics <data>
INFO [2019-04-24 19:01:03] Loading coverage files...
WARN [2019-04-24 19:01:08] Multiple chromosomes with very low coverage: chr1_KI270766v1_alt,chr15_KI270850v1_alt,chr17_KI270909v1_alt,chr19_KI270938v1_alt,chr22_KI270879v1_alt,chr7_KI270803v1_alt
INFO [2019-04-24 19:01:11] Mean target coverages: 163X (tumor) 148X (normal).
INFO [2019-04-24 19:01:12] Mean coverages: chrX: 86.37, chrY: 114.91, chr1-22: 113.83.
INFO [2019-04-24 19:01:12] Mean coverages: chrX: 81.43, chrY: 113.69, chr1-22: 109.96.
WARN [2019-04-24 19:01:17] tumor.coverage.file and interval.file do not align.
INFO [2019-04-24 19:01:38] Removing 2443 intervals with missing log.ratio.
INFO [2019-04-24 19:01:39] Removing 11 low/high GC targets.
INFO [2019-04-24 19:01:39] Removing 10332 intervals excluded in normalDB.
INFO [2019-04-24 19:01:39] Removing 1 intervals with low total coverage in normal (< 150.00 reads).
INFO [2019-04-24 19:01:39] normalDB provided. Setting minimum coverage for segmentation to 0.0015X.
INFO [2019-04-24 19:01:39] Removing 9644 low coverage (< 0.0015X) intervals.
INFO [2019-04-24 19:01:40] Using 248489 intervals (225643 on-target, 22846 off-target).
INFO [2019-04-24 19:01:40] Ratio of mean on-target vs. off-target read counts: 0.37
INFO [2019-04-24 19:01:40] Mean off-target bin size: 91181
INFO [2019-04-24 19:01:41] AT/GC dropout: 1.04 (tumor), 1.03 (normal). 
INFO [2019-04-24 19:01:41] Loading VCF...
INFO [2019-04-24 19:04:09] Found 793439 variants in VCF file.
INFO [2019-04-24 19:04:23] Removing 12196 triallelic sites.
WARN [2019-04-24 19:04:43] vcf.file has no DB or POP_AF info field for membership in germline databases. Guessing it based on available dbSNP ID.
INFO [2019-04-24 19:04:48] 220315 (28.2%) variants annotated as likely germline (DB INFO flag).
INFO [2019-04-24 19:05:50] 13004_tumor is tumor in VCF file.
INFO [2019-04-24 19:06:16] 847 homozygous and 202 heterozygous variants on chrX.
INFO [2019-04-24 19:06:16] Sex from VCF: M (Fisher's p-value: < 0.0001, odds-ratio: 10.04).
INFO [2019-04-24 19:06:16] Detected MuTect2 VCF.
INFO [2019-04-24 19:06:32] Removing 641523 MuTect2 calls due to blacklisted failure reasons.
INFO [2019-04-24 19:07:05] Initial testing for significant sample cross-contamination: maybe
INFO [2019-04-24 19:07:09] Removing 101587 variants with AF < 0.030 or AF >= 0.970 or less than 3 supporting reads or depth < 15.
TylerMclaughlin commented 5 years ago

Alright, I tried removing all the indels from the VCF with bcftools. I also tried keeping only monoallelic alternate alleles. Unfortunately neither method worked.

Here's an example of an entry in the input VCF:

chr1    602107  rs144672821     C       T       .       bad_haplotype;clustered_events;LowPriority      ClippingRankSum=-1.593;DP=9;ECNT=3;FS=0;MQ=45.32;MQ0=0;MQRankSum=1.221;POP_AF=5e-08;P_CONTAM=0;P_GERMLINE=-2.348;REF_BASES=CCCTCTGCAGCGTCCATGCCG;ReadPosRankSum=1.221;TLOD=34.05; EPR=lowseverity 
 GT:AD:AF:DP:F1R2:F2R1:MBQ:MFRL:MMQ:MPOS:PGT:PID:SA_MAP_AF:SA_POST_PROB 
   0/1:1,8:0.889:9:1,7:0,1:37,37:297,142:47:58:0|1:602067_A_G:0.879,0.889,0.889:0.027,0.027,0.947

The columns are CHROM . POS . ID . REF . ALT . QUAL . FILTER . INFO (shortened for readability on GitHub) . FORMAT . TUMOR_SAMPLE

lima1 commented 5 years ago

Thanks, yes, PureCN should correctly deal with indels and multi-allelic sites. And yes, the log-file looks good (except for the low coverage warnings, make sure to remove alt-contigs from the input BED file, but this should not affect this filtering function - I'll add a warning to IntervalFile.R).

@andrewrech pushed a few fixes for Mutect2, maybe the current GitHub works? If not, I'll try to reproduce in the coming days.

TylerMclaughlin commented 5 years ago

Thanks for the update, Markus.

Using version 1.13.22, which I installed using BiocManager::install("lima1/PureCN") still returns the same exact error. The log file confirms that this is the version that is actually being used. Is this the latest version?

lima1 commented 5 years ago

Yes, thanks, that's the latest version. I'll try to reproduce with latest GATK4 and will get back to you soon.

andrewrech commented 5 years ago

Thank you Markus, I am happy to test this, may take me some time

On Apr 29, 2019, at 04:54, M. Riester notifications@github.com wrote:

Yes, thanks, that's the latest version. I'll try to reproduce with latest GATK4 and will get back to you soon.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.

TylerMclaughlin commented 5 years ago

I can safely say that if I use a FreeBayes .vcf instead of a Mutect2 .vcf (both generated using bcbio), I progress beyond this step without issues. PureCN is currently evaluating 19 purity/ploidy combinations via grid search.

lima1 commented 5 years ago

Great, feel free to send a screenshot of the BAF-plot (Sampleid.pdf) and log file output by email. I can usually quickly point to problems or possible improvements.

Thorough testing of GATK4.1 is now high on my priority list for the next release, now that they added some cfDNA specific filters.

Not sure if bcbio still supports Mutect1, but if it's easy for you to run, I would suggest running it at least for testing purposes.

TylerMclaughlin commented 5 years ago

Great! I will send figures and a note to your email address. Thanks for offering this!!

lima1 commented 5 years ago

Closing for now, GATK4/Mutect2 progress documented in https://github.com/lima1/PureCN/issues/6