lima1 / PureCN

Copy number calling and variant classification using targeted short read sequencing
https://bioconductor.org/packages/devel/bioc/html/PureCN.html
Artistic License 2.0
125 stars 32 forks source link

Error: VCF either contains no germline variants or variants are not properly #357

Closed ShvartsmanIrina closed 3 months ago

ShvartsmanIrina commented 4 months ago

Hello lima, I'm trying to run PureCN with NormalDB, but I get Error: INFO [2024-03-11 20:23:58] PureCN 2.8.1 INFO [2024-03-11 20:23:58] ------------------------------------------------------------ INFO [2024-03-11 20:23:58] Arguments: -tumor.coverage.file /cromwell-executions/PurecnTumor/a9de9374-76c9-44cd-a3c6-e68e5fe130a8/call-PureCN/inputs/-285599999/102_S544_coverage_loess.txt.gz -log.ratio -seg.file -vcf.file /cromwell-executions/PurecnTumor/a9de9374-76c9-44cd-a3c6-e68e5fe130a8/call-PureCN/inputs/1991514784/102_S544.vcf -genome hg38 -sex ? -args.setPriorVcf 6 -args.setMappingBiasVcf /cromwell-executions/PurecnTumor/a9de9374-76c9-44cd-a3c6-e68e5fe130a8/call-PureCN/inputs/1662463198/mapping_bias_hg38.rds -args.filterIntervals 100,0.05 -args.segmentation 0.005,NULL, -sampleid 102_S544 -min.ploidy 1.4 -max.ploidy 6 -max.non.clonal 0.2 -max.homozygous.loss 0.05,1e+07 -log.ratio.calibration 0.1 -model.homozygous FALSE -error 0.001 -interval.file /cromwell-executions/PurecnTumor/a9de9374-76c9-44cd-a3c6-e68e5fe130a8/call-PureCN/inputs/1241338070/baits_hg38_intervals.txt -min.logr.sdev 0.15 -max.segments 300 -plot.cnv TRUE -vcf.field.prefix PureCN. -cosmic.vcf.file -DB.info.flag DB -POPAF.info.field POP_AF -Cosmic.CNT.info.field Cosmic.CNT -model beta -post.optimize TRUE -BPPARAM -log.file ./102_S544.log -normal.coverage.file -normalDB -args.filterVcf -fun.segmentation -test.num.copy -test.purity -speedup.heuristics INFO [2024-03-11 20:23:58] Loading coverage files... INFO [2024-03-11 20:24:02] Mean target coverages: 6X (tumor) 6X (normal). INFO [2024-03-11 20:24:04] Mean coverages: chrX: 1.68, chrY: 0.95, chr1-22: 1.73. INFO [2024-03-11 20:24:04] Mean coverages: chrX: 2.30, chrY: 0.95, chr1-22: 1.55. INFO [2024-03-11 20:24:14] Removing 129536 intervals with missing log.ratio. INFO [2024-03-11 20:24:14] Removing 100 low/high GC targets. INFO [2024-03-11 20:24:15] Removing 43878 intervals excluded in normalDB. INFO [2024-03-11 20:24:15] Removing 9905 intervals with low total coverage in normal (< 150.00 reads). INFO [2024-03-11 20:24:15] normalDB provided. Setting minimum coverage for segmentation to 0.0015X. INFO [2024-03-11 20:24:15] Removing 17731 low count (< 100 total reads) intervals. INFO [2024-03-11 20:24:15] Removing 1383 low coverage (< 0.0015X) intervals. INFO [2024-03-11 20:24:15] Using 42217 intervals (23238 on-target, 18979 off-target). INFO [2024-03-11 20:24:15] Ratio of mean on-target vs. off-target read counts: 0.11 INFO [2024-03-11 20:24:15] Mean off-target bin size: 99902 INFO [2024-03-11 20:24:15] AT/GC dropout: 0.83 (tumor), 0.79 (normal), 1.04 (coverage log-ratio). INFO [2024-03-11 20:24:15] High GC-bias in normal and/or tumor coverage. INFO [2024-03-11 20:24:15] Loading VCF... INFO [2024-03-11 20:24:29] Found 593360 variants in VCF file. INFO [2024-03-11 20:24:31] Removing 4120 triallelic sites. INFO [2024-03-11 20:24:32] Maximum of POPAP INFO is > 1, assuming -log10 scaled values WARN [2024-03-11 20:24:32] vcf.file has no DB info field for membership in germline databases. Found and used valid population allele frequency > 0.001000 instead. INFO [2024-03-11 20:24:33] 0 (0.0%) variants annotated as likely germline (DB INFO flag). FATAL [2024-03-11 20:24:33] VCF either contains no germline variants or variants are not properly

FATAL [2024-03-11 20:24:33] annotated.

FATAL [2024-03-11 20:24:33]

FATAL [2024-03-11 20:24:33] This is most likely a user error due to invalid input data or

FATAL [2024-03-11 20:24:33] parameters (PureCN 2.8.1).

I'm using Muteck2 both for building NormalDB and tumor sample: gatk Mutect2 \ -R ~{ref_fasta} \ -I ~{bam} \ -O ~{BAM_pre}.vcf \ --genotype-germline-sites true \ --genotype-pon-sites true \ --interval-padding 50

my tumor.vcf is:

fileformat=VCFv4.2

FORMAT=

FORMAT=

FORMAT=

FORMAT=

FORMAT=

FORMAT=

FORMAT=

FORMAT=

FORMAT=

FORMAT=

FORMAT=

FORMAT=

FORMAT=

GATKCommandLine=<ID=Mutect2,CommandLine="Mutect2 --genotype-pon-sites true --genotype-germline-sites true --output 102_S544.vcf --interval-padding 50 --input /cromwell-executions/Mutect2/34dc964c-ae1e-4d3a-9d62-fd1164a46541/call-M1/inputs/-1728826228/102_S544.bam --reference /cromwell-executions/Mutect2/34dc964c-ae1e-4d3a-9d62-fd1164a46541/call-M1/inputs/1253363328/hg38.fa --f1r2-median-mq 50 --f1r2-min-bq 20 --f1r2-max-depth 200 --flow-likelihood-parallel-threads 0 --flow-likelihood-optimized-comp false --trim-to-haplotype true --exact-matching false --flow-use-t0-tag false --flow-probability-threshold 0.003 --flow-remove-non-single-base-pair-indels false --flow-remove-one-zero-probs false --flow-quantization-bins 121 --flow-fill-empty-bins-value 0.001 --flow-symmetric-indel-probs false --flow-report-insertion-or-deletion false --flow-disallow-probs-larger-than-call false --flow-lump-probs false --flow-retain-max-n-probs-base-format false --flow-probability-scaling-factor 10 --flow-order-cycle-length 4 --keep-boundary-flows false --af-of-alleles-not-in-resource -1.0 --mitochondria-mode false --mutect3-training-mode false --mutect3-ref-downsample 10 --mutect3-alt-downsample 20 --mutect3-non-artifact-ratio 20 --tumor-lod-to-emit 3.0 --initial-tumor-lod 2.0 --pcr-snv-qual 40 --pcr-indel-qual 40 --base-qual-correction-factor 5 --max-population-af 0.01 --downsampling-stride 1 --callable-depth 10 --max-suspicious-reads-per-alignment-start 0 --normal-lod 2.2 --ignore-itr-artifacts false --gvcf-lod-band -2.5 --gvcf-lod-band -2.0 --gvcf-lod-band -1.5 --gvcf-lod-band -1.0 --gvcf-lod-band -0.5 --gvcf-lod-band 0.0 --gvcf-lod-band 0.5 --gvcf-lod-band 1.0 --minimum-allele-fraction 0.0 --independent-mates false --flow-mode NONE --disable-adaptive-pruning false --kmer-size 10 --kmer-size 25 --dont-increase-kmer-sizes-for-cycles false --allow-non-unique-kmers-in-ref false --num-pruning-samples 1 --min-dangling-branch-length 4 --recover-all-dangling-branches false --max-num-haplotypes-in-population 128 --min-pruning 2 --adaptive-pruning-initial-error-rate 0.001 --pruning-lod-threshold 2.302585092994046 --pruning-seeding-lod-threshold 9.210340371976184 --max-unpruned-variants 100 --linked-de-bruijn-graph false --disable-artificial-haplotype-recovery false --enable-legacy-graph-cycle-detection false --debug-assembly false --debug-graph-transformations false --capture-assembly-failure-bam false --num-matching-bases-in-dangling-end-to-recover -1 --error-correction-log-odds -Infinity --error-correct-reads false --kmer-length-for-read-error-correction 25 --min-observations-for-kmer-to-be-solid 20 --likelihood-calculation-engine PairHMM --base-quality-score-threshold 18 --dragstr-het-hom-ratio 2 --dont-use-dragstr-pair-hmm-scores false --pair-hmm-gap-continuation-penalty 10 --expected-mismatch-rate-for-read-disqualification 0.02 --pair-hmm-implementation FASTEST_AVAILABLE --pcr-indel-model CONSERVATIVE --phred-scaled-global-read-mismapping-rate 45 --disable-symmetric-hmm-normalizing false --disable-cap-base-qualities-to-map-quality false --enable-dynamic-read-disqualification-for-genotyping false --dynamic-read-disqualification-threshold 1.0 --native-pair-hmm-threads 4 --native-pair-hmm-use-double-precision false --flow-hmm-engine-min-indel-adjust 6 --flow-hmm-engine-flat-insertion-penatly 45 --flow-hmm-engine-flat-deletion-penatly 45 --pileup-detection false --use-pdhmm false --use-pdhmm-overlap-optimization false --make-determined-haps-from-pd-code false --print-pileupcalling-status false --fallback-gga-if-pdhmm-fails true --pileup-detection-enable-indel-pileup-calling false --pileup-detection-active-region-phred-threshold 0.0 --num-artificial-haplotypes-to-add-per-allele 5 --artifical-haplotype-filtering-kmer-size 10 --pileup-detection-snp-alt-threshold 0.1 --pileup-detection-indel-alt-threshold 0.1 --pileup-detection-absolute-alt-depth 0.0 --pileup-detection-snp-adjacent-to-assembled-indel-range 5 --pileup-detection-snp-basequality-filter 12 --pileup-detection-bad-read-tolerance 0.0 --pileup-detection-proper-pair-read-badness true --pileup-detection-edit-distance-read-badness-threshold 0.08 --pileup-detection-chimeric-read-badness true --pileup-detection-template-mean-badness-threshold 0.0 --pileup-detection-template-std-badness-threshold 0.0 --pileup-detection-filter-assembly-alt-bad-read-tolerance 0.0 --pileup-detection-edit-distance-read-badness-for-assembly-filtering-threshold 0.12 --bam-writer-type CALLED_HAPLOTYPES --dont-use-soft-clipped-bases false --override-fragment-softclip-check false --min-base-quality-score 10 --smith-waterman FASTEST_AVAILABLE --emit-ref-confidence NONE --max-mnp-distance 1 --force-call-filtered-alleles false --reference-model-deletion-quality 30 --soft-clip-low-quality-ends false --allele-informative-reads-overlap-margin 2 --smith-waterman-dangling-end-match-value 25 --smith-waterman-dangling-end-mismatch-penalty -50 --smith-waterman-dangling-end-gap-open-penalty -110 --smith-waterman-dangling-end-gap-extend-penalty -6 --smith-waterman-haplotype-to-reference-match-value 200 --smith-waterman-haplotype-to-reference-mismatch-penalty -150 --smith-waterman-haplotype-to-reference-gap-open-penalty -260 --smith-waterman-haplotype-to-reference-gap-extend-penalty -11 --smith-waterman-read-to-haplotype-match-value 10 --smith-waterman-read-to-haplotype-mismatch-penalty -15 --smith-waterman-read-to-haplotype-gap-open-penalty -30 --smith-waterman-read-to-haplotype-gap-extend-penalty -5 --flow-assembly-collapse-hmer-size 0 --flow-assembly-collapse-partial-mode false --flow-filter-alleles false --flow-filter-alleles-qual-threshold 30.0 --flow-filter-alleles-sor-threshold 3.0 --flow-filter-lone-alleles false --flow-filter-alleles-debug-graphs false --min-assembly-region-size 50 --max-assembly-region-size 300 --active-probability-threshold 0.002 --max-prob-propagation-distance 50 --force-active false --assembly-region-padding 100 --padding-around-indels 75 --padding-around-snps 20 --padding-around-strs 75 --max-extension-into-assembly-region-padding-legacy 25 --max-reads-per-alignment-start 50 --enable-legacy-assembly-region-trimming false --interval-set-rule UNION --interval-exclusion-padding 0 --interval-merging-rule ALL --read-validation-stringency SILENT --seconds-between-progress-updates 10.0 --disable-sequence-dictionary-validation false --create-output-bam-index true --create-output-bam-md5 false --create-output-variant-index true --create-output-variant-md5 false --max-variants-per-shard 0 --lenient false --add-output-sam-program-record true --add-output-vcf-command-line true --cloud-prefetch-buffer 40 --cloud-index-prefetch-buffer -1 --disable-bam-index-caching false --sites-only-vcf-output false --help false --version false --showHidden false --verbosity INFO --QUIET false --use-jdk-deflater false --use-jdk-inflater false --gcs-max-retries 20 --gcs-project-for-requester-pays --disable-tool-default-read-filters false --minimum-mapping-quality 20 --max-read-length 2147483647 --min-read-length 30 --disable-tool-default-annotations false --enable-all-annotations false",Version="4.5.0.0",Date="March 11, 2024 at 1:44:25 PM GMT">

INFO=

INFO=

INFO=

INFO=

INFO=

INFO=

INFO=

INFO=

INFO=

INFO=

INFO=

INFO=

INFO=

INFO=

INFO=

INFO=

INFO=

INFO=

INFO=

INFO=

INFO=

INFO=

INFO=

INFO=

MutectVersion=2.2

contig=

contig=

contig=

contig=

contig=

contig=

contig=

contig=

....

filtering_status=Warning: unfiltered Mutect 2 calls. Please run FilterMutectCalls to remove false positives.

source=Mutect2

tumor_sample=102_S544

CHROM POS ID REF ALT QUAL FILTER INFO FORMAT 102_S544

chr1 13273 . G C . . AS_SB_TABLE=18,17|11,10;DP=60;ECNT=1;MBQ=40,40;MFRL=195,214;MMQ=27,24;MPOS=27;POPAF=7.30;TLOD=48.28 GT:AD:AF:DP:F1R2:F2R1:FAD:SB 0/1:35,21:0.407:56:22,14:8,7:31,21:18,17,11,10 chr1 13417 . C CGAGA . . AS_SB_TABLE=0,7|0,4;DP=14;ECNT=1;MBQ=40,40;MFRL=201,210;MMQ=22,24;MPOS=31;POPAF=7.30;RPA=2,4;RU=GA;STR;TLOD=14.22 GT:AD:AF:DP:F1R2:F2R1:FAD:SB 0/1:7,4:0.377:11:1,0:1,0:7,4:0,7,0,4 chr1 17385 . G A . . AS_SB_TABLE=9,0|3,0;DP=12;ECNT=1;MBQ=40,40;MFRL=219,180;MMQ=34,34;MPOS=47;POPAF=7.30;TLOD=8.67 GT:AD:AF:DP:F1R2:F2R1:FAD:SB 0/1:9,3:0.286:12:2,0:7,3:9,3:9,0,3,0 chr1 69511 . A G . . AS_SB_TABLE=0,0|24,14;DP=39;ECNT=1;MBQ=0,40;MFRL=0,195;MMQ=60,48;MPOS=20;POPAF=7.30;TLOD=140.01 GT:AD:AF:DP:F1R2:F2R1:FAD:SB 0/1:0,38:0.972:38:0,18:0,14:0,33:0,0,24,14 chr1 121009 . C T . . AS_SB_TABLE=0,0|0,0;DP=3;ECNT=1;MBQ=40,40;MFRL=180,233;MMQ=29,60;MPOS=25;POPAF=7.30;TLOD=3.65 GT:AD:AF:DP:F1R2:F2R1:FAD:SB 0/1:1,1:0.500:2:0,0:1,1:1,1:0,1,0,1

I would really appreciate your help!

lima1 commented 4 months ago

Hi @ShvartsmanIrina , PureCN is not made for low coverage data. Anyways, looks like you didn't use gnomAD in the Mutect call. I recommend following https://gatk.broadinstitute.org/hc/en-us/articles/360035531132--How-to-Call-somatic-mutations-using-GATK4-Mutect2 as closely as possible (including the PoN filtering).

ShvartsmanIrina commented 4 months ago

Dear, @lima1, thank you very much for the answer! I solved this problem by adding flags in the vcf results of normal samples by FilterMutectCalls and, accordingly, when calling the tumor sample, I added the --germline-resource flag. Regarding coverage, I am validating the workflow on samples that are not entirely suitable, but I intend to run it on tumor only ovarian cancer samples, when the final result should be an HRD score. If we're talking about coverage - what is the minimum coverage you recommend?

lima1 commented 4 months ago

Depends what you want to do. Low coverage makes the variant classification very difficult, so don't expect reliable calls there. The SNP allelic fractions will also have a high standard deviation, so it won't help the segmentation a lot. We run panels with > 1000X coverage, so I don't have a lot of experience with low coverage data. 80X is probably a good minimum for what we promise in the paper. You can also try something like ichorCNA which is designed for very low coverage.

ShvartsmanIrina commented 3 months ago

Thank you, @lima1 !