error with vcf from gatk mutect2 + FilterMutectCalls (v4.2.0.0) on custom reference sequence #167

I'm trying to use PureCN 1.20.0 from conda (I installed the missing package r-optparse from conda). My vcf file is the output of gatk FilterMutectCalls on gatk Mutect2 (--genotype-germline-sites true) output. I skip the Normal.DB step because the pon from mutect2 is empty.

I use the recommended CNVkit usage without --snpblacklist and --mappingbiasfile because i am working with a custom reference sequence :

# Export the segmentation in DNAcopy format export seg $OUT/$SAMPLEID/${SAMPLEID}_cnvkit.cns \
    -o $OUT/$SAMPLEID/${SAMPLEID}_cnvkit.seg

# Run PureCN by providing the *.cnr and *.seg files 
Rscript $PURECN/PureCN.R --out $OUT/$SAMPLEID  \
    --sampleid $SAMPLEID \
    --tumor $OUT/$SAMPLEID/${SAMPLEID}_cnvkit.cnr \
    --segfile $OUT/$SAMPLEID/${SAMPLEID}_cnvkit.seg \
    --vcf ${SAMPLEID}_mutect.vcf \
    --statsfile ${SAMPLEID}_mutect_stats.txt \
    --genome hg38 \
    --funsegmentation Hclust \
    --force --postoptimize --seed 123

but i have this error :

INFO [2021-03-19 19:08:43] ------------------------------------------------------------
INFO [2021-03-19 19:08:43] PureCN 1.20.0
INFO [2021-03-19 19:08:43] ------------------------------------------------------------
INFO [2021-03-19 19:08:43] Arguments: -normal.coverage.file  -tumor.coverage.file result/3_cnvkit/llc151721nmlsr.cnr -log.ratio  -seg.file result/3_cnvkit/llc151721nmlsr.seg -vcf.file result/2_mutect2/llc151721nmlsr_filtered.vcf -normalDB  -genome hg38 -sex ? -args.setPriorVcf 6 -args.setMappingBiasVcf NULL -args.segmentation 0.005,NULL -sampleid llc151721nmlsr -min.ploidy 1.4 -max.ploidy 6 -max.non.clonal 0.2 -max.homozygous.loss 0.05,1e+07 -log.ratio.calibration 0.1 -model.homozygous FALSE -error 0.001 -interval.file  -max.segments 300 -plot.cnv TRUE -vcf.field.prefix PureCN. DB POP_AF -model beta -post.optimize TRUE -BPPARAM  -log.file result/4_purecn/llc151721nmlsr.log -args.filterVcf <data> -fun.segmentation <data> -test.num.copy <data> -test.purity <data> -speedup.heuristics <data>
INFO [2021-03-19 19:08:43] Loading coverage files...
INFO [2021-03-19 19:08:45] Provided log2-ratio looks too noisy, using segmentation only.
WARN [2021-03-19 19:08:45] Expecting numeric chromosome names in seg.file, assuming file is properly sorted.
WARN [2021-03-19 19:08:45] Allosome coverage missing, cannot determine sex.
WARN [2021-03-19 19:08:45] Allosome coverage missing, cannot determine sex.
INFO [2021-03-19 19:08:45] Using 12 intervals (12 on-target, 0 off-target).
INFO [2021-03-19 19:08:45] No off-target intervals. If this is hybrid-capture data, consider adding them.
INFO [2021-03-19 19:08:45] Loading VCF...
INFO [2021-03-19 19:08:48] Found 39 variants in VCF file.
INFO [2021-03-19 19:08:48] Removing 1 triallelic sites.
INFO [2021-03-19 19:08:49] Maximum of POPAP INFO is > 1, assuming -log10 scaled values
WARN [2021-03-19 19:08:49] vcf.file has no DB info field for membership in germline databases. Found and used valid population allele frequency > 0.001000 instead.
INFO [2021-03-19 19:08:49] 0 (0.0%) variants annotated as likely germline (DB INFO flag).
FATAL [2021-03-19 19:08:49] VCF either contains no germline variants or variants are not properly 

FATAL [2021-03-19 19:08:49] annotated. 

FATAL [2021-03-19 19:08:49]  

FATAL [2021-03-19 19:08:49] This is most likely a user error due to invalid input data or 

FATAL [2021-03-19 19:08:49] parameters (PureCN 1.20.0). 

header + one record of the vcf from gatk FilterMutectCalls (gatk version annotated as germline in FILTER fields:

smu 2686    .   C   T   .   clustered_events;germline   AS_FilterStatus=SITE;AS_SB_TABLE=51,52|54,45;DP=249;ECNT=3;GERMQ=5;MBQ=24,23;MFRL=0,0;MMQ=60,60;MPOS=20;POPAF=7.30;TLOD=188.78  GT:AD:AF:DP:F1R2:F2R1:SB    0/1:103,99:0.483:202:51,2:47,53:51,52,54,45

seg file :

ID  chrom   loc.start   loc.end num.mark    seg.mean
llc151721nmlsr  smu 267 4529    12  -0.131475

cnr file :

chromosome  start   end gene    depth   log2    weight
smu 266 532 SMU 253.376 0.747959    0.190888
smu 532 799 SMU 181.64  0.169248    0.0342295
smu 799 1065    SMU 26.6992 -1.17278    0.0001
smu 1598    1864    SMU 15.3346 2.51879 0.0119792
smu 1864    2131    SMU 3.58052 -3.18721    0.0217088
smu 2397    2664    SMU 169.476 -0.169248   0.0001
smu 2930    3196    SMU 299.229 3.34499 0.156584
smu 3196    3463    SMU 67.7154 3.87191 0.0075058
smu 3463    3729    SMU 0.195489    -6.19039    0.0976755
smu 3729    3996    SMU 1.22472 -1.94301    0.0001
smu 3996    4262    SMU 3133.14 -1.19368    0.235508
smu 4262    4529    SMU 2858.29 0.350737    0.257705

Thanks in advance for your help.

non-human data is not really tested and the data you have unfortunately does not look like it can work. PureCN assumes a complete genome with multiple somatic copy number alterations. These SCNAs also need to cover multiple germline SNPs.

The amount of data in your log file is about 1-2 orders of magnitude lower than what PureCN expects.

Best, Markus