lima1 / PureCN

Copy number calling and variant classification using targeted short read sequencing
https://bioconductor.org/packages/devel/bioc/html/PureCN.html
Artistic License 2.0
128 stars 32 forks source link

Invalid Format Crash in runAbsoluteCN #201

Closed drmrgd closed 3 years ago

drmrgd commented 3 years ago

Hi Markus, Over the last two days I've been getting a strange error on our cluster:

INFO [2021-09-21 20:27:15] MAPD of 11095 allelic fractions: 0.04 (0.03 adjusted).
Error in (function (fmt, ...)  :
  invalid format '%f'; use format %s for character objects
Calls: runAbsoluteCN ... flog.info -> .log_level -> layout -> do.call -> <Anonymous>
In addition: Warning message:
In .bcfHeaderAsSimpleList(header) :
  duplicate keys in header will be forced to unique rownames
Execution halted

At first I thought it was related to the latest dev version 1.99.31, to which I upgraded yesterday while trying to optimize some oversegmentation issues I'm having. However, I've tried with v1.23.27, the previous version I was using, which worked OK, version 1.22.2, which is the default version our cluster maintainer has installed, and v1.20.0, which is the default version that out cluster maintainer has installed for R version 4.0.5. Speaking of which, the version of R under which I was running PureCN v1.23.27 and v1.99.31 was R v4.1.0.

This is failing right before the call to PSCBS in runAbsoluteCN from the last message to the log file I think, but I can't quite figure out the call and offending line in code. I did attempt to run with CSB segmentation instead, just in case PSCBS was the one sending the message to the logger and causing the crash, but that didn't seem to solve it.

My guess is that there was some collateral package that was upgraded at some point, which has changed the way string formatting is working or something, but it's really hard for me to figure out. In case this serves as some kind of breadcrumb, here is the sessionInfo for the 1.99.31 attempt:

> sessionInfo()
R version 4.1.0 (2021-05-18)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: CentOS Linux 7 (Core)

Matrix products: default
BLAS/LAPACK: /usr/local/intel/compilers_and_libraries_2020.2.254/linux/mkl/lib/intel64_lin/libmkl_rt.so

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C
 [9] LC_ADDRESS=C               LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

attached base packages:
[1] stats4    parallel  stats     graphics  grDevices utils     datasets
[8] methods   base

other attached packages:
 [1] PureCN_1.99.31              VariantAnnotation_1.38.0
 [3] Rsamtools_2.8.0             Biostrings_2.60.2
 [5] XVector_0.32.0              SummarizedExperiment_1.22.0
 [7] Biobase_2.52.0              GenomicRanges_1.44.0
 [9] GenomeInfoDb_1.28.4         IRanges_2.26.0
[11] S4Vectors_0.30.0            MatrixGenerics_1.4.3
[13] matrixStats_0.61.0          BiocGenerics_0.38.0
[15] DNAcopy_1.66.0

loaded via a namespace (and not attached):
 [1] bitops_1.0-7             bit64_4.0.5              filelock_1.0.2
 [4] RColorBrewer_1.1-2       progress_1.2.2           httr_1.4.2
 [7] tools_4.1.0              utf8_1.2.2               R6_2.5.1
[10] DBI_1.1.1                colorspace_2.0-2         rhdf5filters_1.4.0
[13] tidyselect_1.1.1         gridExtra_2.3            prettyunits_1.1.1
[16] bit_4.0.4                curl_4.3.2               compiler_4.1.0
[19] formatR_1.11             xml2_1.3.2               DelayedArray_0.18.0
[22] rtracklayer_1.52.1       scales_1.1.1             rappdirs_0.3.3
[25] stringr_1.4.0            digest_0.6.27            pkgconfig_2.0.3
[28] dbplyr_2.1.1             fastmap_1.1.0            BSgenome_1.60.0
[31] rlang_0.4.11             rstudioapi_0.13          RSQLite_2.2.8
[34] VGAM_1.1-5               BiocIO_1.2.0             generics_0.1.0
[37] mclust_5.4.7             BiocParallel_1.26.2      dplyr_1.0.7
[40] RCurl_1.98-1.5           magrittr_2.0.1           GenomeInfoDbData_1.2.6
[43] futile.logger_1.4.3      Matrix_1.3-4             Rcpp_1.0.7
[46] munsell_0.5.0            Rhdf5lib_1.14.2          fansi_0.5.0
[49] lifecycle_1.0.0          stringi_1.7.4            yaml_2.2.1
[52] zlibbioc_1.38.0          rhdf5_2.36.0             BiocFileCache_2.0.0
[55] grid_4.1.0               blob_1.2.2               crayon_1.4.1
[58] lattice_0.20-45          splines_4.1.0            GenomicFeatures_1.44.2
[61] hms_1.1.0                KEGGREST_1.32.0          pillar_1.6.2
[64] rjson_0.2.20             biomaRt_2.48.3           futile.options_1.0.1
[67] XML_3.99-0.8             glue_1.4.2               lambda.r_1.2.4
[70] data.table_1.14.0        png_0.1-7                vctrs_0.3.8
[73] gtable_0.3.0             purrr_0.3.4              assertthat_0.2.1
[76] cachem_1.0.6             ggplot2_3.3.5            restfulr_0.0.13
[79] tibble_3.1.4             GenomicAlignments_1.28.0 AnnotationDbi_1.54.1
[82] memoise_2.0.0            ellipsis_0.3.2

Do you have a suggestion for what's throwing this error and how I might fix it? Thanks in advance!

lima1 commented 3 years ago

Hmm, so this is failing now with all versions, all segmentation functions and all samples that previously worked? That is strange... Running it without parallelization works?

drmrgd commented 3 years ago

Yeah, you got it right. I haven't tried without parallelization (I'll queue that up right now), but I sort of doubt that's it. I'm betting some other function that PureCN calls has been updated and is not happy with some string that's being passed to it as part of the logging process or something similar (looks like the standard printf kind of complaint when a string is passed to a float format directive. I showed the PSCBS example above, but even with CBS, I get the same error, but with a preceding log line:

INFO [2021-09-22 15:02:01] Interval weights found, will use weighted CBS.
INFO [2021-09-22 15:02:04] Loading pre-computed boundaries for DNAcopy...
Error in (function (fmt, ...)  :
  invalid format '%f'; use format %s for character objects
Calls: runAbsoluteCN ... flog.info -> .log_level -> layout -> do.call -> <Anonymous>
In addition: Warning message:
In .bcfHeaderAsSimpleList(header) :
  duplicate keys in header will be forced to unique rownames
Execution halted
lima1 commented 3 years ago

Looks like undo.SD is parsed as character. That’s the next step with logged output. You see anything wrong from your side?

drmrgd commented 3 years ago

Yikes! That seems to be the problem. My snakefile somehow had some non-printing chars or something in it that was inputting a bad param to undo.SD. Looks like it's past that part now and chugging along nicely! Thanks for the help! It was hard to figure out from my end what the next call was and what was choking the process up.

lima1 commented 3 years ago

Yeah, it will add a check for it. I thought optparse will do that for me, but looks like it only throws a warning,

lima1 commented 3 years ago

Regarding oversegmentation, the GATK4 segmentation could be worth a try in your case: higher purity, lots of SNPs, not a lot off-target - pretty much the opposite of what I tuned PSCBS for with our panels and cfDNA.

drmrgd commented 3 years ago

Thanks Markus! Always happy when the fix is simple. And thanks for the suggestion to check out GATK4 segmentation. I'll have a look to see it it'll improve the output a bit.