girke-lab / ChemmineR

Cheminformatics Toolkit for R
13 stars 7 forks source link

`validSDF()` doesn't check header block #16

Closed Aariq closed 1 year ago

Aariq commented 1 year ago

I noticed that validSDF() doesn't check the header block (which is maybe why the error I described in #14 is getting through past read.SDFset()). Here's a reprex and suggestion.

library(ChemmineR)

mol_good <- tempfile(fileext = ".mol")
mol_bad <- tempfile(fileext = ".mol")
writeLines(c(
  "CO2",
  "",
  "",
  "3  2  0  0  0  0  0  0  0  0999 V2000",
  "21.8400  -11.9918    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0",
  "20.6288  -12.6940    0.0000 O   0  0  0  0  0  0  0  0  0  0  0  0",
  "23.0512  -12.6940    0.0000 O   0  0  0  0  0  0  0  0  0  0  0  0",
  "1  2  2  0     0  0",
  "1  3  2  0     0  0",
  "M  END"
), mol_good)

writeLines(c(
  "CO2",
  "",
  #missing header line
  "3  2  0  0  0  0  0  0  0  0999 V2000",
  "21.8400  -11.9918    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0",
  "20.6288  -12.6940    0.0000 O   0  0  0  0  0  0  0  0  0  0  0  0",
  "23.0512  -12.6940    0.0000 O   0  0  0  0  0  0  0  0  0  0  0  0",
  "1  2  2  0     0  0",
  "1  3  2  0     0  0",
  "M  END"
), mol_bad)

(sdf <- ChemmineR::read.SDFset(mol_good))
#> An instance of "SDFset" with 1 molecules
(sdf_bad <- ChemmineR::read.SDFset(mol_bad))
#> An instance of "SDFset" with 1 molecules
validSDF(sdf)
#> CMP1 
#> TRUE
validSDF(sdf_bad)
#> CMP1 
#> TRUE

propOB(sdf)
#>      cansmi cansmiNS formula title               InChI HBA1 HBA2 HBD    logP
#> CMP1  O=C=O    O=C=O     CO2   CO2 InChI=1S/CO2/c2-1-3    2    2   0 -0.5835
#>         MR      MW nF  TPSA
#> CMP1 5.007 44.0095  0 34.14
propOB(sdf_bad)
#>      cansmi cansmiNS formula title InChI HBA1 HBA2 HBD logP MR MW nF TPSA
#> CMP1                           CO2          0    0   0    0  0  0  0    0

(header_good <- ChemmineR::header(sdf))
#> $CMP1
#>                           Molecule_Name                                  Source 
#>                                   "CO2"                                      "" 
#>                                 Comment                             Counts_Line 
#>                                      "" "3  2  0  0  0  0  0  0  0  0999 V2000"
(header_bad <- ChemmineR::header(sdf_bad))
#> $CMP1
#> [1] "CO2"                                  
#> [2] ""                                     
#> [3] "3  2  0  0  0  0  0  0  0  0999 V2000"

# possible ways to validate header

all(sapply(header_good, function(x) length(x) == 4))
#> [1] TRUE
all(sapply(header_bad, function(x) length(x) == 4))
#> [1] FALSE

Created on 2023-06-30 with reprex v2.0.2

Session info ``` r sessioninfo::session_info() #> ─ Session info ─────────────────────────────────────────────────────────────── #> setting value #> version R version 4.2.2 (2022-10-31) #> os macOS Big Sur ... 10.16 #> system x86_64, darwin17.0 #> ui X11 #> language (EN) #> collate en_US.UTF-8 #> ctype en_US.UTF-8 #> tz America/Los_Angeles #> date 2023-06-30 #> pandoc 3.1.1 @ /Applications/RStudio.app/Contents/Resources/app/quarto/bin/tools/ (via rmarkdown) #> #> ─ Packages ─────────────────────────────────────────────────────────────────── #> package * version date (UTC) lib source #> base64enc 0.1-3 2015-07-28 [1] CRAN (R 4.2.0) #> bitops 1.0-7 2021-04-24 [1] CRAN (R 4.2.0) #> ChemmineOB * 1.36.0 2022-11-01 [1] Bioconductor #> ChemmineR * 3.49.2 2022-12-25 [1] https://girke-lab.r-universe.dev (R 4.2.2) #> cli 3.6.1 2023-03-23 [1] CRAN (R 4.2.0) #> colorspace 2.1-0 2023-01-23 [1] CRAN (R 4.2.0) #> DBI 1.1.3 2022-06-18 [1] CRAN (R 4.2.0) #> digest 0.6.31 2022-12-11 [1] CRAN (R 4.2.0) #> dplyr 1.1.0 2023-01-29 [1] CRAN (R 4.2.0) #> DT 0.27 2023-01-17 [1] CRAN (R 4.2.0) #> evaluate 0.20 2023-01-17 [1] CRAN (R 4.2.0) #> fansi 1.0.4 2023-01-22 [1] CRAN (R 4.2.0) #> fastmap 1.1.1 2023-02-24 [1] CRAN (R 4.2.0) #> fs 1.6.1 2023-02-06 [1] CRAN (R 4.2.0) #> generics 0.1.3 2022-07-05 [1] CRAN (R 4.2.0) #> ggplot2 3.4.2 2023-04-03 [1] CRAN (R 4.2.0) #> glue 1.6.2 2022-02-24 [1] CRAN (R 4.2.0) #> gridExtra 2.3 2017-09-09 [1] CRAN (R 4.2.0) #> gtable 0.3.3 2023-03-21 [1] CRAN (R 4.2.0) #> htmltools 0.5.5 2023-03-23 [1] CRAN (R 4.2.0) #> htmlwidgets 1.6.1 2023-01-07 [1] CRAN (R 4.2.0) #> knitr 1.42 2023-01-25 [1] CRAN (R 4.2.0) #> lifecycle 1.0.3 2022-10-07 [1] CRAN (R 4.2.0) #> magrittr 2.0.3 2022-03-30 [1] CRAN (R 4.2.0) #> munsell 0.5.0 2018-06-12 [1] CRAN (R 4.2.0) #> pillar 1.9.0 2023-03-22 [1] CRAN (R 4.2.0) #> pkgconfig 2.0.3 2019-09-22 [1] CRAN (R 4.2.0) #> png 0.1-8 2022-11-29 [1] CRAN (R 4.2.0) #> purrr 1.0.1 2023-01-10 [1] CRAN (R 4.2.0) #> R.cache 0.16.0 2022-07-21 [1] CRAN (R 4.2.0) #> R.methodsS3 1.8.2 2022-06-13 [1] CRAN (R 4.2.0) #> R.oo 1.25.0 2022-06-12 [1] CRAN (R 4.2.0) #> R.utils 2.12.2 2022-11-11 [1] CRAN (R 4.2.0) #> R6 2.5.1 2021-08-19 [1] CRAN (R 4.2.0) #> Rcpp 1.0.10 2023-01-22 [1] CRAN (R 4.2.0) #> RCurl 1.98-1.10 2023-01-27 [1] CRAN (R 4.2.0) #> reprex 2.0.2 2022-08-17 [1] CRAN (R 4.2.0) #> rjson 0.2.21 2022-01-09 [1] CRAN (R 4.2.0) #> rlang 1.1.1 2023-04-28 [1] CRAN (R 4.2.0) #> rmarkdown 2.20 2023-01-19 [1] CRAN (R 4.2.0) #> rstudioapi 0.14 2022-08-22 [1] CRAN (R 4.2.0) #> rsvg 2.4.0 2022-11-21 [1] CRAN (R 4.2.0) #> scales 1.2.1 2022-08-20 [1] CRAN (R 4.2.0) #> sessioninfo 1.2.2 2021-12-06 [1] CRAN (R 4.2.0) #> styler 1.9.1 2023-03-04 [1] CRAN (R 4.2.0) #> tibble 3.2.1 2023-03-20 [1] CRAN (R 4.2.0) #> tidyselect 1.2.0 2022-10-10 [1] CRAN (R 4.2.0) #> utf8 1.2.3 2023-01-31 [1] CRAN (R 4.2.2) #> vctrs 0.6.2 2023-04-19 [1] CRAN (R 4.2.0) #> withr 2.5.0 2022-03-03 [1] CRAN (R 4.2.0) #> xfun 0.37 2023-01-31 [1] CRAN (R 4.2.0) #> yaml 2.3.7 2023-01-23 [1] CRAN (R 4.2.0) #> zlibbioc 1.44.0 2022-11-01 [1] Bioconductor #> #> [1] /Library/Frameworks/R.framework/Versions/4.2/Resources/library #> #> ────────────────────────────────────────────────────────────────────────────── ```
tgirke commented 1 year ago

That's indeed not what validSDF checks in its current form. It only checks the atom and bond blocks (see help doc for ?validSDF) mainly to determine whether compounds are drug-like enough for structural similarity comparisons. For instance, compounds without carbons or single atom compounds are flagged because most fragment based similarity metrics are not meaningful in those cases. Your use-case is different, and could be added as a different checking mode to the same function. Please also keep in mind that SDFs are quite complex and there are a vast number of validity checks one could come up with. Happy to add this to the to-do list or accept code contributions. A more generic solution might require additional discussion.

For debugging please also note that there is a read.SDFstr function that creates an intermediate of SDFset called SDFstr. Imports into SDFstr are much more forgiving and very helpful for debugging. After creating a SDFstr you can coerce to SDFset with as(sdfstr, "SDFset"). In fact, internally read.SDFset does it this way.

khoran commented 1 year ago

This issue will be addressed in a future release. Closing for now.

Aariq commented 1 year ago

Got it! I haven't fully dug into the internals of ChemmineR, but it makes sense that there are too many ways SDFs could go wrong to catch them all. Just for further context, this was motivated by my surprise that I could read in an invalid .mol file (missing a line in the header block) and get no error until later in the workflow when I tried to do something with it (like propOB). Even then, the error isn't an R error (#14) and doesn't stop ChemmineR from returning non-sensical values.

tgirke commented 1 year ago

You mean non-sensical values for OpenBabel functions. The ChemmineR functions are not affected by this inconsistency in the header block. Only (some of) the OpenBabel functions that are provided via the ChemmineOB interface. This is certainly something we want to address too. To come up with a more generic checker utility that considers the OpenBabel restrictions, we might want to first generate a list of requirements and then draft an implementation that doesn't degrade performance (e.g. too many checks by default). Let us know if you have additional suggestions for this checker list so that we can incorporate them. To indicate problems with OpenBabel operations, one could also trigger a warning when OpenBabel functions return only zero values. In addition, one could return NA instead of zero values in cases where it is 100% clear that zero is a failed prediction rather than something meaningful.

Again thanks for reporting this issue.

Thomas

On Wed, Jul 5, 2023 at 8:03 AM Eric R. Scott @.***> wrote:

Got it! I haven't fully dug into the internals of ChemmineR, but it makes sense that there are too many ways SDFs could go wrong to catch them all. Just for further context, this was motivated by my surprise that I could read in an invalid .mol file (missing a line in the header block) and get no error until later in the workflow when I tried to do something with it (like propOB). Even then, the error isn't an R error (#14 https://github.com/girke-lab/ChemmineR/issues/14) and doesn't stop ChemmineR from returning non-sensical values.

— Reply to this email directly, view it on GitHub https://github.com/girke-lab/ChemmineR/issues/16#issuecomment-1621946875, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAKGMVH3RXE5L73WEB3QYSLXOV62RANCNFSM6AAAAAAZ2LPA6Y . You are receiving this because you commented.Message ID: @.***>

-- Thomas Girke, Ph.D. Professor of Bioinformatics Director of High-Performance Computing Center (HPCC) Director of Graduate Program in Genetics, Genomics and Bioinformatics (GGB) 1207F Genomics Building University of California Riverside, CA 92521

E-mail: @.*** URL: https://girke.bioinformatics.ucr.edu Phone/Cell/Text: 951-732-7072 Fax: 951-827-4437