Closed paulhtyang closed 2 years ago
Hi Paul,
PureCN is mainly developed for high coverage data > 100X typical in diagnostic panel sequencing. 20X is probably not sufficient. I don't know why the normal coverage is 1X - seems like you are using the normal database, and not --normal, for coverage normalization. You can try using it without GC-normalization (using *coverage.txt.gz instead of coverage_loess.txt.gz for everything), but not sure that will make it run through.
Markus
Good morning Markus,
Thanks do much and looking forward to your precious comments, Paul
Hi Paul,
can you say a bit more about your setup? The pool of normal is essentially used to learn the assay. If your normals are sequenced using different assays, that won't work.
The command lines of NormalDB.R and PureCN.R and an explanation of the input files would be helpful.
Markus
Hello Markus,
Thanks so much for your reply. Yes, the tumor BAM file was generated by sureselectXT_600Genes (Caris targeting sequencing) whiles the panel of normal BAM files (whole exome sequencing, WES) selected from from 1000 genome projects and other normal BAM files. Thus, I used the BED file (Twist_Exome_Target_hg38.bed) to unify or find the common genomic regions for variant calling and CNV detections between tumor (targeted) and normal (WES) samples. Please see the command lines (below) based on the best practice of PurCN document. Feel free to let me know if you need more information to resolve my issue. Thanks again and looking forward to next comments.
Yes, that unfortunately does not work. Not familiar with the Caris panel, but you can try reaching out to the vendor. They are often helpful and might be able to share a few normal BAM files. It's not ideal, especially if there are significant differences like read lengths, but often good enough. Push the lab to sequence a few normal samples (2-5) in the next 1-3 batches and you're good. See the FAQ for questions about normal samples.
I'm slowly working on using tumors and a very small number of normals for normalization, but no timeline yet.
Describe the issue I am currently running the PureCN 2.3.0 with an input VCF file from Mutect2 (GATK v4.2.6.1), and the job was halted due to an error message below.
To Reproduce Copy and paste your complete command line arguments from PureCN.R. If possible and potentially relevant, also copy the output of NormalDB.R and Coverage.R.
Rscript /data/anaconda3/envs/DNA-seq/lib/R/library/PureCN/extdata/PureCN.R --out /data/projects/BC_Tan/GDC/PureCN/output --tumor /data/projects/BC_Tan/GDC/PureCN/TN21-146507_fastqtosam.interleaved_coverage_loess.txt.gz --sampleid TN21-146507 --vcf /data/projects/BC_Tan/GDC/TN21-146507.realignment-filtered.vcf --normaldb /data/projects/BC_Tan/GDC/normal/normalDB_agilent_v6_hg38.rds --intervals /data/projects/BC_Tan/GDC/PureCN/baits_hg38_intervals.txt --mapping-bias-file /data/projects/BC_Tan/GDC/normal/mapping_bias_agilent_v6_hg38.rds --fun-segmentation PSCBS --genome hg38 --force --post-optimize --seed 123
Expected behavior A clear and concise description of what you expected to happen.
Log file Please copy and paste the log file (Sampleid.log) of a representative example [1] "/data/projects/BC_Tan/GDC/PureCN" INFO [2022-05-11 10:25:19] Loading PureCN 2.3.0... INFO [2022-05-11 10:25:25] Allosome coverage very low, cannot determine sex. INFO [2022-05-11 10:25:25] Sample sex: NA INFO [2022-05-11 10:25:26] ------------------------------------------------------------ INFO [2022-05-11 10:25:26] PureCN 2.3.0 INFO [2022-05-11 10:25:26] ------------------------------------------------------------ INFO [2022-05-11 10:25:26] Arguments: -tumor.coverage.file /data/projects/BC_Tan/GDC/PureCN/TN21-146507_fastqtosam.interleaved_coverage_l oess.txt.gz -log.ratio -seg.file -vcf.file /data/projects/BC_Tan/GDC/TN21-146507.realignment-filtered.vcf -genome hg38 -sex ? -args.set PriorVcf 6 -args.setMappingBiasVcf /data/projects/BC_Tan/GDC/normal/mapping_bias_agilent_v6_hg38.rds -args.filterIntervals 100,0.05 -args .segmentation 0.005,NULL, -sampleid TN21-146507 -min.ploidy 1.4 -max.ploidy 6 -max.non.clonal 0.2 -max.homozygous.loss 0.05,1e+07 -log.ra tio.calibration 0.1 -model.homozygous FALSE -error 0.001 -interval.file /data/projects/BC_Tan/GDC/PureCN/baits_hg38_intervals.txt -min.lo gr.sdev 0.15 -max.segments 300 -plot.cnv TRUE -vcf.field.prefix PureCN. -cosmic.vcf.file -DB.info.flag DB -POPAF.info.field POP_AF -Cosm ic.CNT.info.field Cosmic.CNT -model beta -post.optimize TRUE -BPPARAM -log.file /data/projects/BC_Tan/GDC/PureCN/output.log -normal.cove rage.file -normalDB -args.filterVcf -fun.segmentation -test.num.copy -test.purity -speedup.heur istics INFO [2022-05-11 10:25:26] Loading coverage files... INFO [2022-05-11 10:25:30] Mean target coverages: 20X (tumor) 1X (normal). WARN [2022-05-11 10:25:30] Large difference in coverage of tumor and normal. WARN [2022-05-11 10:25:31] Large potential mis-calibration of on- and off-target log2 ratios: 0.67 INFO [2022-05-11 10:25:31] Allosome coverage very low, cannot determine sex. INFO [2022-05-11 10:25:31] Mean coverages: chrX: 1.42, chrY: 0.03, chr1-22: 1.10. WARN [2022-05-11 10:25:31] Sex tumor/normal mismatch: tumor = INFO [2022-05-11 10:25:46] Removing 138713 intervals with missing log.ratio. INFO [2022-05-11 10:25:46] Removing 63 low/high GC targets. INFO [2022-05-11 10:25:47] Removing 8735 intervals excluded in normalDB. INFO [2022-05-11 10:25:47] Removing 60282 intervals with low total coverage in normal (< 150.00 reads). INFO [2022-05-11 10:25:47] normalDB provided. Setting minimum coverage for segmentation to 0.0015X. INFO [2022-05-11 10:25:47] Removing 5800 low count (< 100 total reads) intervals. INFO [2022-05-11 10:25:47] Removing 862 low coverage (< 0.0015X) intervals. INFO [2022-05-11 10:25:47] Using 14329 intervals (0 on-target, 14329 off-target). INFO [2022-05-11 10:25:47] Ratio of mean on-target vs. off-target read counts: NaN INFO [2022-05-11 10:25:47] Mean off-target bin size: 129092 INFO [2022-05-11 10:25:47] Loading VCF... INFO [2022-05-11 10:25:48] Found 3249 variants in VCF file. INFO [2022-05-11 10:25:48] Maximum of POPAP INFO is > 1, assuming -log10 scaled values WARN [2022-05-11 10:25:48] vcf.file has no DB info field for membership in germline databases. Found and used valid population allele fre quency > 0.001000 instead. INFO [2022-05-11 10:25:49] 0 (0.0%) variants annotated as likely germline (DB INFO flag). FATAL [2022-05-11 10:25:49] VCF either contains no germline variants or variants are not properly
FATAL [2022-05-11 10:25:49] annotated.
FATAL [2022-05-11 10:25:49]
FATAL [2022-05-11 10:25:49] This is most likely a user error due to invalid input data or
FATAL [2022-05-11 10:25:49] parameters (PureCN 2.3.0).
Error: VCF either contains no germline variants or variants are not properly annotated.
This is most likely a user error due to invalid input data or parameters (PureCN 2.3.0). In addition: Warning messages: 1: In (function (fmt, ...) : 3 arguments not used by format 'Sex tumor/normal mismatch: tumor = ' 2: In .bcfHeaderAsSimpleList(header) : duplicate keys in header will be forced to unique rownames Execution halted
B-allele frequency plot Please take a screenshot of the B-allele frequency plot of the maximum likelihood solution (Sampleid.pdf). NA
Session Info Please start R, type sessionInfo() and paste the output. R version 4.1.3 (2022-03-10) Platform: x86_64-conda-linux-gnu (64-bit) Running under: Ubuntu 14.04.5 LTS
Matrix products: default BLAS/LAPACK: /data/anaconda3/envs/DNA-seq/lib/libopenblasp-r0.3.20.so
locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 [7] LC_PAPER=en_US.UTF-8 LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
attached base packages: [1] stats graphics grDevices utils datasets methods base
loaded via a namespace (and not attached): [1] compiler_4.1.3