lima1 / PureCN

Copy number calling and variant classification using targeted short read sequencing
https://bioconductor.org/packages/devel/bioc/html/PureCN.html
Artistic License 2.0
127 stars 32 forks source link

Error running PureCN.R #279

Closed bosmont closed 1 year ago

bosmont commented 1 year ago

Hi,

I am running PureCN.R script with my WES data. The script runs successfully for most of the samples. But it failed for a few samples with the following error:

INFO [2023-04-29 08:05:55] Loading coverage files... INFO [2023-04-29 08:05:59] Mean target coverages: 220X (tumor) 201X (normal). INFO [2023-04-29 08:06:01] Mean coverages: chrX: 174.03, chrY: 90.76, chr1-22: 197.32. INFO [2023-04-29 08:06:01] Mean coverages: chrX: 125.07, chrY: 90.86, chr1-22: 188.87. INFO [2023-04-29 08:06:17] Removing 5448 intervals with missing log.ratio. INFO [2023-04-29 08:06:17] Removing 21219 intervals excluded in normalDB. INFO [2023-04-29 08:06:17] normalDB provided. Setting minimum coverage for segmentation to 0.0015X. INFO [2023-04-29 08:06:17] Removing 34 low count (< 100 total reads) intervals. INFO [2023-04-29 08:06:17] Removing 8893 low coverage (< 0.0015X) intervals. WARN [2023-04-29 08:06:18] Low number of off-target intervals. You might want to exclude them (217925 on-target, 18160 off-target, ratio 0.08). INFO [2023-04-29 08:06:18] Using 236085 intervals (217925 on-target, 18160 off-target). INFO [2023-04-29 08:06:18] Ratio of mean on-target vs. off-target read counts: 0.23 INFO [2023-04-29 08:06:18] Mean off-target bin size: 107712 INFO [2023-04-29 08:06:18] AT/GC dropout: 1.03 (tumor), 1.04 (normal), 0.98 (coverage log-ratio). INFO [2023-04-29 08:06:18] Loading VCF... INFO [2023-04-29 08:06:26] Found 151623 variants in VCF file. INFO [2023-04-29 08:06:27] Removing 1640 triallelic sites. INFO [2023-04-29 08:06:27] 142385 (94.9%) variants annotated as likely germline (DB INFO flag). WARN [2023-04-29 08:06:31] Found 6 variants with missing allelic fraction starting with chr1:31579040_G/GTTTAATTTCAGTGAGAGCTGGTGTGTCTATGTGTGTGTGTGTGTGATGTCTTCAGTTATAA. Removing them. INFO [2023-04-29 08:06:31] SL250767_st_t is tumor in VCF file. INFO [2023-04-29 08:06:31] 1853 homozygous and 528 heterozygous variants on chrX. INFO [2023-04-29 08:06:31] Sex from VCF: M (Fisher's p-value: < 0.0001, odds-ratio: 7.52). INFO [2023-04-29 08:06:31] Detected MuTect2 VCF. INFO [2023-04-29 08:06:32] Removing 0 Mutect2 calls due to blacklisted failure reasons. INFO [2023-04-29 08:06:33] Removing 0 low quality variants with non-offset BQ < 25. INFO [2023-04-29 08:06:33] Base quality scores range from 29 to 49 (offset by 1) INFO [2023-04-29 08:06:33] High median base quality score (49). UMI-barcoded data? INFO [2023-04-29 08:06:33] Minimum number of supporting reads ranges from 2 to 17, depending on coverage and BQS. INFO [2023-04-29 08:06:53] Initial testing for significant sample cross-contamination: maybe INFO [2023-04-29 08:06:54] Removing 45242 variants with AF < 0.030 or AF >= 0.970 or insufficient supporting reads or depth < 15. INFO [2023-04-29 08:06:54] Total size of targeted genomic region: 66.31Mb (81.86Mb with 50bp padding). INFO [2023-04-29 08:06:55] 25.3% of targets contain variants. INFO [2023-04-29 08:06:55] Removing 33104 variants outside intervals. INFO [2023-04-29 08:06:55] Setting somatic prior probabilities for dbSNP hits to 0.000500 or to 0.500000 otherwise. INFO [2023-04-29 08:06:55] Loading mapping bias file mapping_bias_hg38.NIM.rds... INFO [2023-04-29 08:06:57] Found 1095970 variants in mapping bias file. INFO [2023-04-29 08:07:19] Imputing mapping bias for 7262 variants... INFO [2023-04-29 08:09:32] Excluding 10559 novel or poor quality variants from segmentation. INFO [2023-04-29 08:09:33] Excluding 7262 variants not in pool of normals from segmentation. INFO [2023-04-29 08:09:33] Sample sex: M INFO [2023-04-29 08:09:33] Segmenting data... INFO [2023-04-29 08:09:33] Interval weights found, will use weighted PSCBS. INFO [2023-04-29 08:09:33] MAPD of 44346 allelic fractions: 0.06 (0.07 adjusted). INFO [2023-04-29 08:09:34] Setting undo.SD parameter to 0.750000. INFO [2023-04-29 08:09:34] On-target much cleaner than off-target, finding on-target breakpoints first... INFO [2023-04-29 08:09:34] Using 154725 high quality (out of 226471) on-target intervals for initial breakpoint calculation. INFO [2023-04-29 08:10:31] Setting prune.hclust.h parameter to 0.200000. INFO [2023-04-29 08:12:09] Found 1152 segments, exceeding max.segments threshold of 300. INFO [2023-04-29 08:12:09] Setting undo.SD parameter to 1.125000. INFO [2023-04-29 08:12:09] On-target much cleaner than off-target, finding on-target breakpoints first... INFO [2023-04-29 08:12:09] Using 154725 high quality (out of 226471) on-target intervals for initial breakpoint calculation. INFO [2023-04-29 08:12:59] Setting prune.hclust.h parameter to 0.200000. Error: ‘nbrOfTCNsAfter >= nbrOfTCNsBefore’ is not TRUE but Execution halted

Any suggestions what I should do? Thanks so much for your help.

lima1 commented 1 year ago

Can you try with --min-fraction-offtarget 1? Your off target information is noisy and makes it worse. The error happens in PSCBS, seen it before but can’t remember causes. Probably excessive noise.

What is the range of log ratio standard deviations you are getting? FFPE? Old samples?

bosmont commented 1 year ago

Thanks a lot for your help. Below is the distribution of the standard deviations of gene.mean across samples: Min. 1st Qu. Median Mean 3rd Qu. Max. 0.1147 0.2279 0.2939 0.3021 0.3798 0.5470

Does it look bad?

I tried to add the option "--min-fraction-offtarget 1" in PureCN.R command, and it works! But do I need to re-run the all the samples with this option in order to make the results consistent across samples?

Thanks.

lima1 commented 1 year ago

I meant the standard deviations reported in the log file, gene.mean might be a bit better than that?

INFO [2022-08-04 15:00:21] Mean standard deviation of log-ratios: 0.25 (MAPD: 0.19)

A mean log-ratio SD of 0.3 would be ok for WES FFPE. For higher quality WES (>300X and fresh frozen), it should be 0.2 or lower. Depends also on the number of normal samples and batches.

Yes, I would toss the off-target reads. Either re-run IntervalFIle.R without --offtarget, or specify that argument to ignore them from the coverage files.