read_pdb fails in Linux when using a gzip compressed file in R 4.1.0

jmcastagnetto commented 3 years ago

An example of this error:

> p = read_pdb("1kys.pdb.gz")
Error in readChar(con, nchars = 6) : invalid UTF-8 input in readChar()
In addition: Warning message:
In readChar(con, nchars = 6) : truncating string with embedded nuls

Seems like the issue is in https://github.com/tylermorganwall/raymolecule/blob/159959d59c9e2ce9d479397a7c51c929b3854252/R/read_pdb.R#L25, where the PDB file is read using:

con = file(filename,"rb")

Subsequently, it fails when trying to execute:

type = trimws(readChar(con,nchars=6))

Not sure if it is an issue with the new version of R (which I've updated today), or "raymolecule" (which I've installed today).

FWIW, when trying to read the file w/o using "rb", it works:

> con = file("1kys.pdb.gz")
> readChar(con, n = 6)
[1] "HEADER"

As a reference. here is my sessionInfo() output

> sessionInfo()
R version 4.1.0 (2021-05-18)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 20.04.2 LTS

Matrix products: default
BLAS:   /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/liblapack.so.3

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] forcats_0.5.1     stringr_1.4.0     dplyr_1.0.6       purrr_0.3.4      
 [5] readr_1.4.0       tidyr_1.1.3       tibble_3.1.2      ggplot2_3.3.3    
 [9] tidyverse_1.3.1   raymolecule_0.3.2

loaded via a namespace (and not attached):
 [1] Rcpp_1.0.6          lubridate_1.7.10    lattice_0.20-44    
 [4] prettyunits_1.1.1   ps_1.6.0            assertthat_0.2.1   
 [7] rprojroot_2.0.2     utf8_1.2.1          cellranger_1.1.0   
[10] R6_2.5.0            backports_1.2.1     reprex_2.0.0       
[13] httr_1.4.2          pillar_1.6.1        rlang_0.4.11.9000  
[16] readxl_1.3.1        curl_4.3.1          rstudioapi_0.13    
[19] callr_3.7.0         raster_3.4-10       desc_1.3.0         
[22] devtools_2.4.1      munsell_0.5.0       broom_0.7.6        
[25] modelr_0.1.8        compiler_4.1.0      pkgconfig_2.0.3    
[28] pkgbuild_1.2.0      tidyselect_1.1.1    rayrender_0.21.2   
[31] codetools_0.2-18    fansi_0.4.2         crayon_1.4.1       
[34] dbplyr_2.1.1        withr_2.4.2         grid_4.1.0         
[37] jsonlite_1.7.2      gtable_0.3.0        lifecycle_1.0.0    
[40] PeriodicTable_0.1.2 DBI_1.1.1           magrittr_2.0.1     
[43] scales_1.1.1        stringi_1.6.2       cli_2.5.0          
[46] cachem_1.0.5        fs_1.5.0            remotes_2.3.0      
[49] sp_1.4-5            testthat_3.0.2      xml2_1.3.2         
[52] ellipsis_0.3.2      generics_0.1.0      vctrs_0.3.8        
[55] tools_4.1.0         glue_1.4.2          rayimage_0.5.1     
[58] hms_1.1.0           processx_3.5.2      pkgload_1.2.1      
[61] parallel_4.1.0      fastmap_1.1.0       colorspace_2.0-1   
[64] sessioninfo_1.1.1   rvest_1.0.0         memoise_2.0.0      
[67] haven_2.4.1         usethis_2.0.1

tylermorganwall commented 3 years ago

What happens when you read in the non-gzipped file?

jmcastagnetto commented 3 years ago

Sorry, forgot to include that case (which I've already tried).

The plain PDB file works, it is the gzipped version that causes trouble:

> library(raymolecule)
> p = read_pdb("1kys.pdb")
> str(p)
List of 2
 $ atoms:'data.frame':  2044 obs. of  5 variables:
  ..$ x    : num [1:2044] 50.4 50.1 51.2 50.9 52.5 ...
  ..$ y    : num [1:2044] 10.6 11.2 11.1 11.1 11 ...
  ..$ z    : num [1:2044] 14.2 12.9 11.9 10.7 12.2 ...
  ..$ type : chr [1:2044] "N" "C" "C" "O" ...
  ..$ index: int [1:2044] 1 2 3 4 5 6 7 8 9 10 ...
 $ bonds:'data.frame':  72 obs. of  3 variables:
  ..$ from  : num [1:72] 163 456 462 462 463 463 463 464 464 464 ...
  ..$ to    : num [1:72] 1796 462 456 463 462 ...
  ..$ number: num [1:72] 1 1 1 1 1 1 1 1 1 1 ...

And it renders (just tried with 100 samples):

jmcastagnetto commented 3 years ago

Just to check that is not a problem of the size of the file, I downloaded the water SDF file from https://pubchem.ncbi.nlm.nih.gov/compound/962, and converted that to PDB using OpenBabel:

$ obabel -i sdf Structure2D_CID_962.sdf -o pdb -O water.pdb
1 molecule converted
$ cat water.pdb 
COMPND    962 
AUTHOR    GENERATED BY OPEN BABEL 3.0.0
HETATM    1  O   HOH     1       2.537  -0.155   0.000  1.00  0.00           O  
HETATM    2  H   HOH     0       3.074   0.155   0.000  1.00  0.00           H  
HETATM    3  H   HOH     0       2.000   0.155   0.000  1.00  0.00           H  
CONECT    1    2    3                                                 
CONECT    2    1                                                      
CONECT    3    1                                                      
MASTER        0    0    0    0    0    0    0    0    3    0    3    0
END

And tried a series of tests. The plain PDB file works but the gzipped one doesn't:

> p = read_pdb("water.pdb")
> str(p)
List of 2
 $ atoms:'data.frame':  3 obs. of  5 variables:
  ..$ x    : num [1:3] 2.54 3.07 2
  ..$ y    : num [1:3] -0.155 0.155 0.155
  ..$ z    : num [1:3] 0 0 0
  ..$ type : chr [1:3] "O" "H" "H"
  ..$ index: int [1:3] 1 2 3
 $ bonds:'data.frame':  4 obs. of  3 variables:
  ..$ from  : num [1:4] 1 1 2 3
  ..$ to    : num [1:4] 2 3 1 1
  ..$ number: num [1:4] 1 1 1 1

# using the gzip'd file
> p = read_pdb("water.pdb.gz") 
Error in readChar(con, nchars = 6) : invalid UTF-8 input in readChar()

I believe it's something to do with file() in R 4.1.0

# It can be read w/o "rb"
> con = file("water.pdb.gz")
> readChar(con, n = 6)
[1] "COMPND"

# fails when using "rb"
> con = file("water.pdb.gz", open = "rb")
> readChar(con, n = 6)
Error in readChar(con, n = 6) : invalid UTF-8 input in readChar()

# works w/ a warning when using "r"
> con = file("water.pdb.gz", open = "r")
> readChar(con, n = 6)
[1] "COMPND"
Warning message:
In readChar(con, n = 6) :
  text connection used with readChar(), results may be incorrect

# using gzfile() and "rb" works too
> con = gzfile("water.pdb.gz", open = "rb")
> readChar(con, n = 6)
[1] "COMPND"

tylermorganwall commented 3 years ago

I recommend uncompressing the file with the utils::untar() function before reading in the file. Is there a reason you're keeping the file compressed?

jmcastagnetto commented 3 years ago

Yep, that is what I am doing with the molecules I am playing with -- reliving the 90s-00s when I was dealing more with that research.

In the end, it seems to be an issue with the open = "rb" and gzipped files, and not something in read_pdb()

> t1 = file("test.txt", "rb")
> readChar(t1, 6)
[1] "a test"
> t2 = file("test.txt.gz", "rb") 
> readChar(t2, 6)
[1] "\037\x8b\b\bݚ\xaa"
> t3 = gzfile("test.txt.gz", "rb")
> readChar(t3, 6)
[1] "a test"
> t4 = file("test.txt.gz")
> readChar(t4, 6)
[1] "a test"

tylermorganwall / raymolecule

read_pdb fails in Linux when using a gzip compressed file in R 4.1.0 #2