lima1 / PureCN

Copy number calling and variant classification using targeted short read sequencing
https://bioconductor.org/packages/devel/bioc/html/PureCN.html
Artistic License 2.0
125 stars 32 forks source link

error with vcf from gatk mutect2 + FilterMutectCalls (v4.2.0.0) on custom reference sequence #167

Closed Just08 closed 3 years ago

Just08 commented 3 years ago

Hi,

I'm trying to use PureCN 1.20.0 from conda (I installed the missing package r-optparse from conda). My vcf file is the output of gatk FilterMutectCalls on gatk Mutect2 (--genotype-germline-sites true) output. I skip the Normal.DB step because the pon from mutect2 is empty.

I use the recommended CNVkit usage without --snpblacklist and --mappingbiasfile because i am working with a custom reference sequence :

# Export the segmentation in DNAcopy format
cnvkit.py export seg $OUT/$SAMPLEID/${SAMPLEID}_cnvkit.cns \
    -o $OUT/$SAMPLEID/${SAMPLEID}_cnvkit.seg

# Run PureCN by providing the *.cnr and *.seg files 
Rscript $PURECN/PureCN.R --out $OUT/$SAMPLEID  \
    --sampleid $SAMPLEID \
    --tumor $OUT/$SAMPLEID/${SAMPLEID}_cnvkit.cnr \
    --segfile $OUT/$SAMPLEID/${SAMPLEID}_cnvkit.seg \
    --vcf ${SAMPLEID}_mutect.vcf \
    --statsfile ${SAMPLEID}_mutect_stats.txt \
    --genome hg38 \
    --funsegmentation Hclust \
    --force --postoptimize --seed 123

but i have this error :

INFO [2021-03-19 19:08:43] ------------------------------------------------------------
INFO [2021-03-19 19:08:43] PureCN 1.20.0
INFO [2021-03-19 19:08:43] ------------------------------------------------------------
INFO [2021-03-19 19:08:43] Arguments: -normal.coverage.file  -tumor.coverage.file result/3_cnvkit/llc151721nmlsr.cnr -log.ratio  -seg.file result/3_cnvkit/llc151721nmlsr.seg -vcf.file result/2_mutect2/llc151721nmlsr_filtered.vcf -normalDB  -genome hg38 -sex ? -args.setPriorVcf 6 -args.setMappingBiasVcf NULL -args.segmentation 0.005,NULL -sampleid llc151721nmlsr -min.ploidy 1.4 -max.ploidy 6 -max.non.clonal 0.2 -max.homozygous.loss 0.05,1e+07 -log.ratio.calibration 0.1 -model.homozygous FALSE -error 0.001 -interval.file  -max.segments 300 -plot.cnv TRUE -vcf.field.prefix PureCN. -DB.info.flag DB -POPAF.info.field POP_AF -model beta -post.optimize TRUE -BPPARAM  -log.file result/4_purecn/llc151721nmlsr.log -args.filterVcf <data> -fun.segmentation <data> -test.num.copy <data> -test.purity <data> -speedup.heuristics <data>
INFO [2021-03-19 19:08:43] Loading coverage files...
INFO [2021-03-19 19:08:45] Provided log2-ratio looks too noisy, using segmentation only.
WARN [2021-03-19 19:08:45] Expecting numeric chromosome names in seg.file, assuming file is properly sorted.
WARN [2021-03-19 19:08:45] Allosome coverage missing, cannot determine sex.
WARN [2021-03-19 19:08:45] Allosome coverage missing, cannot determine sex.
INFO [2021-03-19 19:08:45] Using 12 intervals (12 on-target, 0 off-target).
INFO [2021-03-19 19:08:45] No off-target intervals. If this is hybrid-capture data, consider adding them.
INFO [2021-03-19 19:08:45] Loading VCF...
INFO [2021-03-19 19:08:48] Found 39 variants in VCF file.
INFO [2021-03-19 19:08:48] Removing 1 triallelic sites.
INFO [2021-03-19 19:08:49] Maximum of POPAP INFO is > 1, assuming -log10 scaled values
WARN [2021-03-19 19:08:49] vcf.file has no DB info field for membership in germline databases. Found and used valid population allele frequency > 0.001000 instead.
INFO [2021-03-19 19:08:49] 0 (0.0%) variants annotated as likely germline (DB INFO flag).
FATAL [2021-03-19 19:08:49] VCF either contains no germline variants or variants are not properly 

FATAL [2021-03-19 19:08:49] annotated. 

FATAL [2021-03-19 19:08:49]  

FATAL [2021-03-19 19:08:49] This is most likely a user error due to invalid input data or 

FATAL [2021-03-19 19:08:49] parameters (PureCN 1.20.0). 

header + one record of the vcf from gatk FilterMutectCalls (gatk version 4.2.0.0) annotated as germline in FILTER fields:

##fileformat=VCFv4.2
##FILTER=<ID=FAIL,Description="Fail the site if all alleles fail but for different reasons.">
##FILTER=<ID=PASS,Description="Site contains at least one allele that passes filters">
##FILTER=<ID=base_qual,Description="alt median base quality">
##FILTER=<ID=clustered_events,Description="Clustered events observed in the tumor">
##FILTER=<ID=contamination,Description="contamination">
##FILTER=<ID=duplicate,Description="evidence for alt allele is overrepresented by apparent duplicates">
##FILTER=<ID=fragment,Description="abs(ref - alt) median fragment length">
##FILTER=<ID=germline,Description="Evidence indicates this site is germline, not somatic">
##FILTER=<ID=haplotype,Description="Variant near filtered variant on same haplotype.">
##FILTER=<ID=low_allele_frac,Description="Allele fraction is below specified threshold">
##FILTER=<ID=map_qual,Description="ref - alt median mapping quality">
##FILTER=<ID=multiallelic,Description="Site filtered because too many alt alleles pass tumor LOD">
##FILTER=<ID=n_ratio,Description="Ratio of N to alt exceeds specified ratio">
##FILTER=<ID=normal_artifact,Description="artifact_in_normal">
##FILTER=<ID=orientation,Description="orientation bias detected by the orientation bias mixture model">
##FILTER=<ID=panel_of_normals,Description="Blacklisted site in panel of normals">
##FILTER=<ID=position,Description="median distance of alt variants from end of reads">
##FILTER=<ID=possible_numt,Description="Allele depth is below expected coverage of NuMT in autosome">
##FILTER=<ID=slippage,Description="Site filtered due to contraction of short tandem repeat region">
##FILTER=<ID=strand_bias,Description="Evidence for alt allele comes from one read direction only">
##FILTER=<ID=strict_strand,Description="Evidence for alt allele is not represented in both directions">
##FILTER=<ID=weak_evidence,Description="Mutation does not meet likelihood threshold">
##FORMAT=<ID=AD,Number=R,Type=Integer,Description="Allelic depths for the ref and alt alleles in the order listed">
##FORMAT=<ID=AF,Number=A,Type=Float,Description="Allele fractions of alternate alleles in the tumor">
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Approximate read depth (reads with MQ=255 or with bad mates are filtered)">
##FORMAT=<ID=F1R2,Number=R,Type=Integer,Description="Count of reads in F1R2 pair orientation supporting each allele">
##FORMAT=<ID=F2R1,Number=R,Type=Integer,Description="Count of reads in F2R1 pair orientation supporting each allele">
##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Genotype Quality">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=PGT,Number=1,Type=String,Description="Physical phasing haplotype information, describing how the alternate alleles are phased in relation to one another; will always be heterozygous and is not intended to describe called alleles">
##FORMAT=<ID=PID,Number=1,Type=String,Description="Physical phasing ID information, where each unique ID within a given sample (but not across samples) connects records within a phasing group">
##FORMAT=<ID=PL,Number=G,Type=Integer,Description="Normalized, Phred-scaled likelihoods for genotypes as defined in the VCF specification">
##FORMAT=<ID=PS,Number=1,Type=Integer,Description="Phasing set (typically the position of the first variant in the set)">
##FORMAT=<ID=SB,Number=4,Type=Integer,Description="Per-sample component statistics which comprise the Fisher's Exact Test to detect strand bias.">
##GATKCommandLine=<ID=FilterMutectCalls,CommandLine="FilterMutectCalls --output result/2_mutect2/llc151721nmlsr_filtered.vcf.gz --stats merged_lsr.stats --variant result/2_mutect2/llc151721nmlsr.vcf.gz --reference data/ighh/IgHHuman_smu.fasta --threshold-strategy OPTIMAL_F_SCORE --f-score-beta 1.0 --false-discovery-rate 0.05 --initial-threshold 0.1 --mitochondria-mode false --max-events-in-region 2 --max-alt-allele-count 1 --unique-alt-read-count 0 --min-median-mapping-quality 30 --min-median-base-quality 20 --max-median-fragment-length-difference 10000 --min-median-read-position 1 --max-n-ratio Infinity --min-reads-per-strand 0 --min-allele-fraction 0.0 --contamination-estimate 0.0 --log-snv-prior -13.815510557964275 --log-indel-prior -16.11809565095832 --log-artifact-prior -2.302585092994046 --normal-p-value-threshold 0.001 --min-slippage-length 8 --pcr-slippage-rate 0.1 --distance-on-haplotype 100 --long-indel-length 5 --interval-set-rule UNION --interval-padding 0 --interval-exclusion-padding 0 --interval-merging-rule ALL --read-validation-stringency SILENT --seconds-between-progress-updates 10.0 --disable-sequence-dictionary-validation false --create-output-bam-index true --create-output-bam-md5 false --create-output-variant-index true --create-output-variant-md5 false --lenient false --add-output-sam-program-record true --add-output-vcf-command-line true --cloud-prefetch-buffer 40 --cloud-index-prefetch-buffer -1 --disable-bam-index-caching false --sites-only-vcf-output false --help false --version false --showHidden false --verbosity INFO --QUIET false --use-jdk-deflater false --use-jdk-inflater false --gcs-max-retries 20 --gcs-project-for-requester-pays  --disable-tool-default-read-filters false",Version="4.2.0.0",Date="March 19, 2021 2:03:28 PM CET">
##GATKCommandLine=<ID=Mutect2,CommandLine="Mutect2 --panel-of-normals /scratch/jpollet/LSR/result/2_mutect2/pon_lsr.vcf.gz --genotype-germline-sites true --output result/2_mutect2/llc151721nmlsr.vcf.gz --input result/1_bwa/llc151721nmlsr_bbtrim.sorted.rg.bam --reference data/ighh/IgHHuman_smu.fasta --f1r2-median-mq 50 --f1r2-min-bq 20 --f1r2-max-depth 200 --genotype-pon-sites false --af-of-alleles-not-in-resource -1.0 --mitochondria-mode false --tumor-lod-to-emit 3.0 --initial-tumor-lod 2.0 --pcr-snv-qual 40 --pcr-indel-qual 40 --max-population-af 0.01 --downsampling-stride 1 --callable-depth 10 --max-suspicious-reads-per-alignment-start 0 --normal-lod 2.2 --ignore-itr-artifacts false --gvcf-lod-band -2.5 --gvcf-lod-band -2.0 --gvcf-lod-band -1.5 --gvcf-lod-band -1.0 --gvcf-lod-band -0.5 --gvcf-lod-band 0.0 --gvcf-lod-band 0.5 --gvcf-lod-band 1.0 --minimum-allele-fraction 0.0 --independent-mates false --disable-adaptive-pruning false --kmer-size 10 --kmer-size 25 --dont-increase-kmer-sizes-for-cycles false --allow-non-unique-kmers-in-ref false --num-pruning-samples 1 --min-dangling-branch-length 4 --recover-all-dangling-branches false --max-num-haplotypes-in-population 128 --min-pruning 2 --adaptive-pruning-initial-error-rate 0.001 --pruning-lod-threshold 2.302585092994046 --pruning-seeding-lod-threshold 9.210340371976184 --max-unpruned-variants 100 --linked-de-bruijn-graph false --disable-artificial-haplotype-recovery false --enable-legacy-graph-cycle-detection false --debug-assembly false --debug-graph-transformations false --capture-assembly-failure-bam false --num-matching-bases-in-dangling-end-to-recover -1 --error-correction-log-odds -Infinity --error-correct-reads false --kmer-length-for-read-error-correction 25 --min-observations-for-kmer-to-be-solid 20 --base-quality-score-threshold 18 --dragstr-het-hom-ratio 2 --dont-use-dragstr-pair-hmm-scores false --pair-hmm-gap-continuation-penalty 10 --expected-mismatch-rate-for-read-disqualification 0.02 --pair-hmm-implementation FASTEST_AVAILABLE --pcr-indel-model CONSERVATIVE --phred-scaled-global-read-mismapping-rate 45 --disable-symmetric-hmm-normalizing false --disable-cap-base-qualities-to-map-quality false --enable-dynamic-read-disqualification-for-genotyping false --dynamic-read-disqualification-threshold 1.0 --native-pair-hmm-threads 4 --native-pair-hmm-use-double-precision false --bam-writer-type CALLED_HAPLOTYPES --dont-use-soft-clipped-bases false --min-base-quality-score 10 --smith-waterman JAVA --emit-ref-confidence NONE --max-mnp-distance 1 --force-call-filtered-alleles false --soft-clip-low-quality-ends false --allele-informative-reads-overlap-margin 2 --min-assembly-region-size 50 --max-assembly-region-size 300 --active-probability-threshold 0.002 --max-prob-propagation-distance 50 --force-active false --assembly-region-padding 100 --padding-around-indels 75 --padding-around-snps 20 --padding-around-strs 75 --max-extension-into-assembly-region-padding-legacy 25 --max-reads-per-alignment-start 50 --enable-legacy-assembly-region-trimming false --interval-set-rule UNION --interval-padding 0 --interval-exclusion-padding 0 --interval-merging-rule ALL --read-validation-stringency SILENT --seconds-between-progress-updates 10.0 --disable-sequence-dictionary-validation false --create-output-bam-index true --create-output-bam-md5 false --create-output-variant-index true --create-output-variant-md5 false --lenient false --add-output-sam-program-record true --add-output-vcf-command-line true --cloud-prefetch-buffer 40 --cloud-index-prefetch-buffer -1 --disable-bam-index-caching false --sites-only-vcf-output false --help false --version false --showHidden false --verbosity INFO --QUIET false --use-jdk-deflater false --use-jdk-inflater false --gcs-max-retries 20 --gcs-project-for-requester-pays  --disable-tool-default-read-filters false --max-read-length 2147483647 --min-read-length 30 --minimum-mapping-quality 20 --disable-tool-default-annotations false --enable-all-annotations false",Version="4.2.0.0",Date="March 19, 2021 1:59:18 PM CET">
##INFO=<ID=AS_FilterStatus,Number=A,Type=String,Description="Filter status for each allele, as assessed by ApplyVQSR. Note that the VCF filter field will reflect the most lenient/sensitive status across all alleles.">
##INFO=<ID=AS_SB_TABLE,Number=1,Type=String,Description="Allele-specific forward/reverse read counts for strand bias tests. Includes the reference and alleles separated by |.">
##INFO=<ID=AS_UNIQ_ALT_READ_COUNT,Number=A,Type=Integer,Description="Number of reads with unique start and mate end positions for each alt at a variant site">
##INFO=<ID=CONTQ,Number=1,Type=Float,Description="Phred-scaled qualities that alt allele are not due to contamination">
##INFO=<ID=DP,Number=1,Type=Integer,Description="Approximate read depth; some reads may have been filtered">
##INFO=<ID=ECNT,Number=1,Type=Integer,Description="Number of events in this haplotype">
##INFO=<ID=GERMQ,Number=1,Type=Integer,Description="Phred-scaled quality that alt alleles are not germline variants">
##INFO=<ID=MBQ,Number=R,Type=Integer,Description="median base quality">
##INFO=<ID=MFRL,Number=R,Type=Integer,Description="median fragment length">
##INFO=<ID=MMQ,Number=R,Type=Integer,Description="median mapping quality">
##INFO=<ID=MPOS,Number=A,Type=Integer,Description="median distance from end of read">
##INFO=<ID=NALOD,Number=A,Type=Float,Description="Negative log 10 odds of artifact in normal with same allele fraction as tumor">
##INFO=<ID=NCount,Number=1,Type=Integer,Description="Count of N bases in the pileup">
##INFO=<ID=NLOD,Number=A,Type=Float,Description="Normal log 10 likelihood ratio of diploid het or hom alt genotypes">
##INFO=<ID=OCM,Number=1,Type=Integer,Description="Number of alt reads whose original alignment doesn't match the current contig.">
##INFO=<ID=PON,Number=0,Type=Flag,Description="site found in panel of normals">
##INFO=<ID=POPAF,Number=A,Type=Float,Description="negative log 10 population allele frequencies of alt alleles">
##INFO=<ID=ROQ,Number=1,Type=Float,Description="Phred-scaled qualities that alt allele are not due to read orientation artifact">
##INFO=<ID=RPA,Number=R,Type=Integer,Description="Number of times tandem repeat unit is repeated, for each allele (including reference)">
##INFO=<ID=RU,Number=1,Type=String,Description="Tandem repeat unit (bases)">
##INFO=<ID=SEQQ,Number=1,Type=Integer,Description="Phred-scaled quality that alt alleles are not sequencing errors">
##INFO=<ID=STR,Number=0,Type=Flag,Description="Variant is a short tandem repeat">
##INFO=<ID=STRANDQ,Number=1,Type=Integer,Description="Phred-scaled quality of strand bias artifact">
##INFO=<ID=STRQ,Number=1,Type=Integer,Description="Phred-scaled quality that alt alleles in STRs are not polymerase slippage errors">
##INFO=<ID=TLOD,Number=A,Type=Float,Description="Log 10 likelihood ratio score of variant existing versus not existing">
##MutectVersion=2.2
##contig=<ID=smu,length=4529>
##filtering_status=These calls have been filtered by FilterMutectCalls to label false positives with a list of failed filters and true positives with PASS.
##source=FilterMutectCalls
##source=Mutect2
##tumor_sample=llc151721nmlsr
#CHROM  POS ID  REF ALT QUAL    FILTER  INFO    FORMAT  llc151721nmlsr
smu 2686    .   C   T   .   clustered_events;germline   AS_FilterStatus=SITE;AS_SB_TABLE=51,52|54,45;DP=249;ECNT=3;GERMQ=5;MBQ=24,23;MFRL=0,0;MMQ=60,60;MPOS=20;POPAF=7.30;TLOD=188.78  GT:AD:AF:DP:F1R2:F2R1:SB    0/1:103,99:0.483:202:51,2:47,53:51,52,54,45

seg file :

ID  chrom   loc.start   loc.end num.mark    seg.mean
llc151721nmlsr  smu 267 4529    12  -0.131475

cnr file :

chromosome  start   end gene    depth   log2    weight
smu 266 532 SMU 253.376 0.747959    0.190888
smu 532 799 SMU 181.64  0.169248    0.0342295
smu 799 1065    SMU 26.6992 -1.17278    0.0001
smu 1598    1864    SMU 15.3346 2.51879 0.0119792
smu 1864    2131    SMU 3.58052 -3.18721    0.0217088
smu 2397    2664    SMU 169.476 -0.169248   0.0001
smu 2930    3196    SMU 299.229 3.34499 0.156584
smu 3196    3463    SMU 67.7154 3.87191 0.0075058
smu 3463    3729    SMU 0.195489    -6.19039    0.0976755
smu 3729    3996    SMU 1.22472 -1.94301    0.0001
smu 3996    4262    SMU 3133.14 -1.19368    0.235508
smu 4262    4529    SMU 2858.29 0.350737    0.257705

Thanks in advance for your help.

lima1 commented 3 years ago

Hi,

non-human data is not really tested and the data you have unfortunately does not look like it can work. PureCN assumes a complete genome with multiple somatic copy number alterations. These SCNAs also need to cover multiple germline SNPs.

The amount of data in your log file is about 1-2 orders of magnitude lower than what PureCN expects.

Best, Markus