user error due to invalid input data or parameters

paulhtyang commented 2 years ago

Describe the issue I am currently running the PureCN 2.3.0 with an input VCF file from Mutect2 (GATK v4.2.6.1), and the job was halted due to an error message below.

To Reproduce Copy and paste your complete command line arguments from PureCN.R. If possible and potentially relevant, also copy the output of NormalDB.R and Coverage.R.

Rscript /data/anaconda3/envs/DNA-seq/lib/R/library/PureCN/extdata/PureCN.R --out /data/projects/BC_Tan/GDC/PureCN/output --tumor /data/projects/BC_Tan/GDC/PureCN/TN21-146507_fastqtosam.interleaved_coverage_loess.txt.gz --sampleid TN21-146507 --vcf /data/projects/BC_Tan/GDC/TN21-146507.realignment-filtered.vcf --normaldb /data/projects/BC_Tan/GDC/normal/normalDB_agilent_v6_hg38.rds --intervals /data/projects/BC_Tan/GDC/PureCN/baits_hg38_intervals.txt --mapping-bias-file /data/projects/BC_Tan/GDC/normal/mapping_bias_agilent_v6_hg38.rds --fun-segmentation PSCBS --genome hg38 --force --post-optimize --seed 123

Expected behavior A clear and concise description of what you expected to happen.

Log file Please copy and paste the log file (Sampleid.log) of a representative example [1] "/data/projects/BC_Tan/GDC/PureCN" INFO [2022-05-11 10:25:19] Loading PureCN 2.3.0... INFO [2022-05-11 10:25:25] Allosome coverage very low, cannot determine sex. INFO [2022-05-11 10:25:25] Sample sex: NA INFO [2022-05-11 10:25:26] ------------------------------------------------------------ INFO [2022-05-11 10:25:26] PureCN 2.3.0 INFO [2022-05-11 10:25:26] ------------------------------------------------------------ INFO [2022-05-11 10:25:26] Arguments: -tumor.coverage.file /data/projects/BC_Tan/GDC/PureCN/TN21-146507_fastqtosam.interleaved_coverage_l oess.txt.gz -log.ratio -seg.file -vcf.file /data/projects/BC_Tan/GDC/TN21-146507.realignment-filtered.vcf -genome hg38 -sex ? -args.set PriorVcf 6 -args.setMappingBiasVcf /data/projects/BC_Tan/GDC/normal/mapping_bias_agilent_v6_hg38.rds -args.filterIntervals 100,0.05 -args .segmentation 0.005,NULL, -sampleid TN21-146507 -min.ploidy 1.4 -max.ploidy 6 -max.non.clonal 0.2 -max.homozygous.loss 0.05,1e+07 -log.ra tio.calibration 0.1 -model.homozygous FALSE -error 0.001 -interval.file /data/projects/BC_Tan/GDC/PureCN/baits_hg38_intervals.txt -min.lo gr.sdev 0.15 -max.segments 300 -plot.cnv TRUE -vcf.field.prefix PureCN. -cosmic.vcf.file -DB.info.flag DB -POPAF.info.field POP_AF -Cosm ic.CNT.info.field Cosmic.CNT -model beta -post.optimize TRUE -BPPARAM -log.file /data/projects/BC_Tan/GDC/PureCN/output.log -normal.cove rage.file -normalDB -args.filterVcf -fun.segmentation -test.num.copy -test.purity -speedup.heur istics INFO [2022-05-11 10:25:26] Loading coverage files... INFO [2022-05-11 10:25:30] Mean target coverages: 20X (tumor) 1X (normal). WARN [2022-05-11 10:25:30] Large difference in coverage of tumor and normal. WARN [2022-05-11 10:25:31] Large potential mis-calibration of on- and off-target log2 ratios: 0.67 INFO [2022-05-11 10:25:31] Allosome coverage very low, cannot determine sex. INFO [2022-05-11 10:25:31] Mean coverages: chrX: 1.42, chrY: 0.03, chr1-22: 1.10. WARN [2022-05-11 10:25:31] Sex tumor/normal mismatch: tumor = INFO [2022-05-11 10:25:46] Removing 138713 intervals with missing log.ratio. INFO [2022-05-11 10:25:46] Removing 63 low/high GC targets. INFO [2022-05-11 10:25:47] Removing 8735 intervals excluded in normalDB. INFO [2022-05-11 10:25:47] Removing 60282 intervals with low total coverage in normal (< 150.00 reads). INFO [2022-05-11 10:25:47] normalDB provided. Setting minimum coverage for segmentation to 0.0015X. INFO [2022-05-11 10:25:47] Removing 5800 low count (< 100 total reads) intervals. INFO [2022-05-11 10:25:47] Removing 862 low coverage (< 0.0015X) intervals. INFO [2022-05-11 10:25:47] Using 14329 intervals (0 on-target, 14329 off-target). INFO [2022-05-11 10:25:47] Ratio of mean on-target vs. off-target read counts: NaN INFO [2022-05-11 10:25:47] Mean off-target bin size: 129092 INFO [2022-05-11 10:25:47] Loading VCF... INFO [2022-05-11 10:25:48] Found 3249 variants in VCF file. INFO [2022-05-11 10:25:48] Maximum of POPAP INFO is > 1, assuming -log10 scaled values WARN [2022-05-11 10:25:48] vcf.file has no DB info field for membership in germline databases. Found and used valid population allele fre quency > 0.001000 instead. INFO [2022-05-11 10:25:49] 0 (0.0%) variants annotated as likely germline (DB INFO flag). FATAL [2022-05-11 10:25:49] VCF either contains no germline variants or variants are not properly

FATAL [2022-05-11 10:25:49] annotated.

FATAL [2022-05-11 10:25:49]

FATAL [2022-05-11 10:25:49] This is most likely a user error due to invalid input data or

FATAL [2022-05-11 10:25:49] parameters (PureCN 2.3.0).

Error: VCF either contains no germline variants or variants are not properly annotated.

This is most likely a user error due to invalid input data or parameters (PureCN 2.3.0). In addition: Warning messages: 1: In (function (fmt, ...) : 3 arguments not used by format 'Sex tumor/normal mismatch: tumor = ' 2: In .bcfHeaderAsSimpleList(header) : duplicate keys in header will be forced to unique rownames Execution halted

B-allele frequency plot Please take a screenshot of the B-allele frequency plot of the maximum likelihood solution (Sampleid.pdf). NA

Session Info Please start R, type sessionInfo() and paste the output. R version 4.1.3 (2022-03-10) Platform: x86_64-conda-linux-gnu (64-bit) Running under: Ubuntu 14.04.5 LTS

Matrix products: default BLAS/LAPACK: /data/anaconda3/envs/DNA-seq/lib/libopenblasp-r0.3.20.so

locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 [7] LC_PAPER=en_US.UTF-8 LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

attached base packages: [1] stats graphics grDevices utils datasets methods base

loaded via a namespace (and not attached): [1] compiler_4.1.3

lima1 commented 2 years ago

Hi Paul,

PureCN is mainly developed for high coverage data > 100X typical in diagnostic panel sequencing. 20X is probably not sufficient. I don't know why the normal coverage is 1X - seems like you are using the normal database, and not --normal, for coverage normalization. You can try using it without GC-normalization (using *coverage.txt.gz instead of coverage_loess.txt.gz for everything), but not sure that will make it run through.

Markus

paulhtyang commented 2 years ago

Good morning Markus,

Thanks for your prompt reply. I added more control samples from 1000 genome project, and the issue was resolved. However, I got another error message below. Noted that the tumor BAM file is from targeted sequencing, but the control samples are whole exome BAM files.

INFO [2022-05-12 23:50:19] Loading PureCN 2.3.0... INFO [2022-05-12 23:50:28] Allosome coverage very low, cannot determine sex. INFO [2022-05-12 23:50:28] Sample sex: NA INFO [2022-05-12 23:50:29] ------------------------------------------------------------ INFO [2022-05-12 23:50:29] PureCN 2.3.0 INFO [2022-05-12 23:50:29] ------------------------------------------------------------ INFO [2022-05-12 23:50:29] Arguments: -tumor.coverage.file /data/projects/BC_Tan/GDC/PureCN/TN21-146507_fastqtosam.interleaved_coverage_l oess.txt.gz -log.ratio -seg.file -vcf.file /data/projects/BC_Tan/GDC/TN21-146507.realignment-filtered.vcf -genome hg38 -sex ? -args.set PriorVcf 6 -args.setMappingBiasVcf /data/projects/BC_Tan/GDC/normal/mapping_bias_Twist_Exome_hg38.rds -args.filterIntervals 100,0.05 -arg s.segmentation 0.005,NULL, -sampleid TN21-146507 -min.ploidy 1.4 -max.ploidy 6 -max.non.clonal 0.2 -max.homozygous.loss 0.05,1e+07 -log.r atio.calibration 0.1 -model.homozygous FALSE -error 0.001 -interval.file /data/projects/BC_Tan/GDC/PureCN/baits_hg38_intervals.txt -min.l ogr.sdev 0.15 -max.segments 300 -plot.cnv TRUE -vcf.field.prefix PureCN. -cosmic.vcf.file -DB.info.flag DB -POPAF.info.field POP_AF -Cos mic.CNT.info.field Cosmic.CNT -model betabin -post.optimize TRUE -BPPARAM -log.file /data/projects/BC_Tan/GDC/PureCN/output.log -normal. coverage.file -normalDB -args.filterVcf -fun.segmentation -test.num.copy -test.purity -speedup. heuristics INFO [2022-05-12 23:50:29] Loading coverage files... INFO [2022-05-12 23:50:33] Mean target coverages: 21X (tumor) 8X (normal). WARN [2022-05-12 23:50:35] Large potential mis-calibration of on- and off-target log2 ratios: 1.61 INFO [2022-05-12 23:50:36] Allosome coverage very low, cannot determine sex. INFO [2022-05-12 23:50:36] Mean coverages: chrX: 3.06, chrY: 0.04, chr1-22: 6.23. WARN [2022-05-12 23:50:36] Sex tumor/normal mismatch: tumor = INFO [2022-05-12 23:50:54] Removing 137065 intervals with missing log.ratio. INFO [2022-05-12 23:50:54] Removing 85 low/high GC targets. INFO [2022-05-12 23:50:55] Removing 13305 intervals excluded in normalDB. INFO [2022-05-12 23:50:55] Removing 54926 intervals with low total coverage in normal (< 150.00 reads). INFO [2022-05-12 23:50:55] normalDB provided. Setting minimum coverage for segmentation to 0.0015X. INFO [2022-05-12 23:50:55] Removing 5824 low count (< 100 total reads) intervals. INFO [2022-05-12 23:50:55] Removing 868 low coverage (< 0.0015X) intervals. INFO [2022-05-12 23:50:55] Using 14408 intervals (6 on-target, 14402 off-target). INFO [2022-05-12 23:50:55] Ratio of mean on-target vs. off-target read counts: 2.31 INFO [2022-05-12 23:50:55] Mean off-target bin size: 129125 Error in h(simpleError(msg, call)) : Error in h(simpleError(msg, call)) : error in evaluating the argument 'x' in selecting a method for function 'mean': subscript out of bounds Calls: runAbsoluteCN ... .checkGCBias -> .calcGCmetric -> mean -> .handleSimpleError -> h In addition: Warning message: In (function (fmt, ...) : 3 arguments not used by format 'Sex tumor/normal mismatch: tumor = ' Execution halted

Thanks do much and looking forward to your precious comments, Paul

lima1 commented 2 years ago

Hi Paul,

can you say a bit more about your setup? The pool of normal is essentially used to learn the assay. If your normals are sequenced using different assays, that won't work.

The command lines of NormalDB.R and PureCN.R and an explanation of the input files would be helpful.

Markus

paulhtyang commented 2 years ago

Hello Markus,

Thanks so much for your reply. Yes, the tumor BAM file was generated by sureselectXT_600Genes (Caris targeting sequencing) whiles the panel of normal BAM files (whole exome sequencing, WES) selected from from 1000 genome projects and other normal BAM files. Thus, I used the BED file (Twist_Exome_Target_hg38.bed) to unify or find the common genomic regions for variant calling and CNV detections between tumor (targeted) and normal (WES) samples. Please see the command lines (below) based on the best practice of PurCN document. Feel free to let me know if you need more information to resolve my issue. Thanks again and looking forward to next comments.

lima1 commented 2 years ago

Yes, that unfortunately does not work. Not familiar with the Caris panel, but you can try reaching out to the vendor. They are often helpful and might be able to share a few normal BAM files. It's not ideal, especially if there are significant differences like read lengths, but often good enough. Push the lab to sequence a few normal samples (2-5) in the next 1-3 batches and you're good. See the FAQ for questions about normal samples.

I'm slowly working on using tumors and a very small number of normals for normalization, but no timeline yet.

lima1 / PureCN

user error due to invalid input data or parameters #235

Thanks for your prompt reply. I added more control samples from 1000 genome project, and the issue was resolved. However, I got another error message below. Noted that the tumor BAM file is from targeted sequencing, but the control samples are whole exome BAM files.