Closed rmcgibbo closed 9 years ago
You can fix with
In [1]: b'bytes'.find(b'string')
Out[1]: -1
"fix"
Does the PDB format specify an encoding?
Not that I know of. Or rather, I don't think PDB files are ever supposed to contain non-ASCII characters.
input_stream
opened to read strings, not bytes.input_stream
created from urllib.request.urlopen
, which yields bytes.One of these two things needs to change. I'm not sure which though.
(for file io, this is the difference between open(fn, 'r')
and open(fn, 'rb')
)
We always use 'r', not 'rb', so that's the problem.
No, that's why it works within OpenMM, because "r"
yields strings, not bytes. It's only urllib that yields bytes, which is why you only see this issue in PDBFixer.
Yes, that's what I meant. The problem is that it expects a string, not bytes. How do we change the urlopen() call to get strings instead?
I don't think there's an easy one liner. You can do urllib.request.urlopen(url).read().decode('utf-8')
, but then you need to stuff it back into a file-like object?
We could change OpenMM to respond appropriately to either strings or bytes too.
What would be the easiest way of doing that? PdbStructure reads lines from the PDB file with
for pdb_line in input_stream:
But if it's a stream of bytes, each iteration will get one byte instead of one line, right? Is there an easy way it can detect the type of stream and convert/wrap/whatever it?
The iterator still splits on bytes-newlines.
In [97]: [line for line in open('README.md', 'rb')]
Out[97]:
[b'## [MSMBuilder: Statistical models for Biomolecular Dynamics](http://msmbuilder.org/)\n',
b'\n',
b'[![Build Status](https://travis-ci.org/msmbuilder/msmbuilder.png?branch=master)](https://travis-ci.org/msmbuilder/msmbuilder) [![PyPi version](https://pypip.in/v/msmbuilder/badge.png)](https://pypi.python.org/pypi/msmbuilder/) [![Supported Python versions](https://pypip.in/py_versions/msmbuilder/badge.svg)](https://pypi.python.org/pypi/msmbuilder/) [![License](https://img.shields.io/badge/license-LGPLv2.1+-red.svg?style=flat)](https://pypi.python.org/pypi/msmbuilder/)\n',
b'[![Documentation Status](https://img.shields.io/badge/docs-latest-blue.svg?style=flat)](http://msmbuilder.org)\n',
b'\n',
b'MSMBuilder is a python package which implements a series of statistical models for high-dimensional time-series -- the particular focus of the library is on the analysis of atomistic simulations of biomolecular dynamics such as protein folding and conformational change. MSMBuilder is available under the LGPL (v2.1 or later).\n',
b'\n',
b'Algorithms available include:\n',
b'\n',
b'- Geometric clustering\n',
b'- Markov state models\n',
b'- Time-structure independent components analysis\n',
b'- L1-regularized reversible hidden Markov models\n',
b'- Transition path theory\n',
b'\n',
b'Resources:\n',
b'- [Documentation](http://msmbuilder.org)\n',
b'- [Mailing List](https://mailman.stanford.edu/mailman/listinfo/msmbuilder-user)\n']
So if we just add "pdb_line = pdb_line.decode('utf-8')", will that work everywhere?
I think so.
So if we just add "pdb_line = pdb_line.decode('utf-8')", will that work everywhere?
No. In Py3 if you open as 'r' instead of 'rb', you read strings. Strings in Py3 don't have a decode
attribute:
>>> 'this is a string'.decode('utf-8')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
AttributeError: 'str' object has no attribute 'decode'
>>> b'these are bytes'.decode('utf-8')
'these are bytes'
This is, by far, the worst part about the Py2->Py3 transition. It ruins duck typing between file-like objects unless you just open everything with 'rb'... but yuck. For instance: gzip.open(file, 'r')
yields bytes... open(file, 'r')
yields strings.
I have a helper class in ParmEd that wraps around an open file-like object. See https://github.com/ParmEd/ParmEd/blob/master/chemistry/formats/io.py#L62-L103 for example.
just check the type before decoding it.
On Thu, Feb 26, 2015 at 4:23 PM, Jason Swails notifications@github.com wrote:
So if we just add "pdb_line = pdb_line.decode('utf-8')", will that work everywhere?
No. In Py3 if you open as 'r' instead of 'rb', you read strings. Strings in Py3 don't have a decode attribute:
'this is a string'.decode('utf-8') Traceback (most recent call last): File "
", line 1, in AttributeError: 'str' object has no attribute 'decode'>>> b'these are bytes'.decode('utf-8')'these are bytes' This is, by far, the worst part about the Py2->Py3 transition. It ruins duck typing between file-like objects unless you just open everything with 'rb'... but yuck. For instance: gzip.open(file, 'r') yields bytes... open(file, 'r') yields strings.
I have a helper class in ParmEd that wraps around an open file-like object. See https://github.com/ParmEd/ParmEd/blob/master/chemistry/formats/io.py#L62-L103 for example.
— Reply to this email directly or view it on GitHub https://github.com/pandegroup/pdbfixer/issues/81#issuecomment-76305778.
Ok, how about this:
if not isinstance(pdb_line, str):
pdb_line = pdb_line.decode('utf-8')
That should work everywhere, right?
How many places do you read a line from the file? You need to do this in each place. I have so many parsers in ParmEd that I needed to write a class to handle it more robustly
PdbStructure is the only one we're talking about here. There are lots of other file readers, but they aren't relevant to PDBFixer.
I meant how many times to you read a line from a file inside PDBFixer? Or is it just one big for line in file
type of thing?
You mean PdbStructure? It's just one big loop.
This is hopefully fixed by https://github.com/pandegroup/openmm/pull/839.
One of the py3 errors is
The reason for this is that
pdbfile
, in this error case, is a urllib request which is a file-like object of bytes. the type ofpdb_line
is bytes.the error is the same error as in
I'm not sure if we should count this as a bug in
simtk/openmm/app/internal/pdbstructure.py
or not. Does the PDB format specify an encoding? Should we consider it as string-based, or bytes based?