Closed jmcastagnetto closed 3 years ago
What happens when you read in the non-gzipped file?
Sorry, forgot to include that case (which I've already tried).
The plain PDB file works, it is the gzipped version that causes trouble:
> library(raymolecule)
> p = read_pdb("1kys.pdb")
> str(p)
List of 2
$ atoms:'data.frame': 2044 obs. of 5 variables:
..$ x : num [1:2044] 50.4 50.1 51.2 50.9 52.5 ...
..$ y : num [1:2044] 10.6 11.2 11.1 11.1 11 ...
..$ z : num [1:2044] 14.2 12.9 11.9 10.7 12.2 ...
..$ type : chr [1:2044] "N" "C" "C" "O" ...
..$ index: int [1:2044] 1 2 3 4 5 6 7 8 9 10 ...
$ bonds:'data.frame': 72 obs. of 3 variables:
..$ from : num [1:72] 163 456 462 462 463 463 463 464 464 464 ...
..$ to : num [1:72] 1796 462 456 463 462 ...
..$ number: num [1:72] 1 1 1 1 1 1 1 1 1 1 ...
And it renders (just tried with 100 samples):
Just to check that is not a problem of the size of the file, I downloaded the water SDF file from https://pubchem.ncbi.nlm.nih.gov/compound/962, and converted that to PDB using OpenBabel:
$ obabel -i sdf Structure2D_CID_962.sdf -o pdb -O water.pdb
1 molecule converted
$ cat water.pdb
COMPND 962
AUTHOR GENERATED BY OPEN BABEL 3.0.0
HETATM 1 O HOH 1 2.537 -0.155 0.000 1.00 0.00 O
HETATM 2 H HOH 0 3.074 0.155 0.000 1.00 0.00 H
HETATM 3 H HOH 0 2.000 0.155 0.000 1.00 0.00 H
CONECT 1 2 3
CONECT 2 1
CONECT 3 1
MASTER 0 0 0 0 0 0 0 0 3 0 3 0
END
And tried a series of tests. The plain PDB file works but the gzipped one doesn't:
> p = read_pdb("water.pdb")
> str(p)
List of 2
$ atoms:'data.frame': 3 obs. of 5 variables:
..$ x : num [1:3] 2.54 3.07 2
..$ y : num [1:3] -0.155 0.155 0.155
..$ z : num [1:3] 0 0 0
..$ type : chr [1:3] "O" "H" "H"
..$ index: int [1:3] 1 2 3
$ bonds:'data.frame': 4 obs. of 3 variables:
..$ from : num [1:4] 1 1 2 3
..$ to : num [1:4] 2 3 1 1
..$ number: num [1:4] 1 1 1 1
# using the gzip'd file
> p = read_pdb("water.pdb.gz")
Error in readChar(con, nchars = 6) : invalid UTF-8 input in readChar()
I believe it's something to do with file()
in R 4.1.0
# It can be read w/o "rb"
> con = file("water.pdb.gz")
> readChar(con, n = 6)
[1] "COMPND"
# fails when using "rb"
> con = file("water.pdb.gz", open = "rb")
> readChar(con, n = 6)
Error in readChar(con, n = 6) : invalid UTF-8 input in readChar()
# works w/ a warning when using "r"
> con = file("water.pdb.gz", open = "r")
> readChar(con, n = 6)
[1] "COMPND"
Warning message:
In readChar(con, n = 6) :
text connection used with readChar(), results may be incorrect
# using gzfile() and "rb" works too
> con = gzfile("water.pdb.gz", open = "rb")
> readChar(con, n = 6)
[1] "COMPND"
I recommend uncompressing the file with the utils::untar()
function before reading in the file. Is there a reason you're keeping the file compressed?
Yep, that is what I am doing with the molecules I am playing with -- reliving the 90s-00s when I was dealing more with that research.
In the end, it seems to be an issue with the open = "rb"
and gzipped files, and not something in read_pdb()
> t1 = file("test.txt", "rb")
> readChar(t1, 6)
[1] "a test"
> t2 = file("test.txt.gz", "rb")
> readChar(t2, 6)
[1] "\037\x8b\b\bݚ\xaa"
> t3 = gzfile("test.txt.gz", "rb")
> readChar(t3, 6)
[1] "a test"
> t4 = file("test.txt.gz")
> readChar(t4, 6)
[1] "a test"
An example of this error:
Seems like the issue is in https://github.com/tylermorganwall/raymolecule/blob/159959d59c9e2ce9d479397a7c51c929b3854252/R/read_pdb.R#L25, where the PDB file is read using:
Subsequently, it fails when trying to execute:
Not sure if it is an issue with the new version of R (which I've updated today), or "raymolecule" (which I've installed today).
FWIW, when trying to read the file w/o using "rb", it works:
As a reference. here is my
sessionInfo()
output