String/Bytes with PDB Files

rmcgibbo commented 9 years ago

One of the py3 errors is

    Traceback (most recent call last):
      File "/home/rmcgibbo/miniconda/envs/3.4.2/lib/python3.4/doctest.py", line 1324, in __run
        compileflags, 1), test.globs)
      File "<doctest pdbfixer.pdbfixer.PDBFixer.__init__[3]>", line 1, in <module>
        fixer = PDBFixer(pdbfile=file)
      File "/home/rmcgibbo/projects/pdbfixer/pdbfixer/pdbfixer.py", line 198, in __init__
        structure = PdbStructure(pdbfile)
      File "/home/rmcgibbo/miniconda/envs/3.4.2/lib/python3.4/site-packages/simtk/openmm/app/internal/pdbstructure.py", line 146, in __init__
        self._load(input_stream)
      File "/home/rmcgibbo/miniconda/envs/3.4.2/lib/python3.4/site-packages/simtk/openmm/app/internal/pdbstructure.py", line 154, in _load
        if (pdb_line.find("ATOM  ") == 0) or (pdb_line.find("HETATM") == 0):
    TypeError: Type str doesn't support the buffer API

The reason for this is that pdbfile, in this error case, is a urllib request which is a file-like object of bytes. the type of pdb_line is bytes.

the error is the same error as in

In [87]: b'bytes'.find('string')
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-87-4e19ac47a0ed> in <module>()
----> 1 b'bytes'.find('string')

TypeError: Type str doesn't support the buffer API

I'm not sure if we should count this as a bug in simtk/openmm/app/internal/pdbstructure.py or not. Does the PDB format specify an encoding? Should we consider it as string-based, or bytes based?

mpharrigan commented 9 years ago

You can fix with

In [1]: b'bytes'.find(b'string')
Out[1]: -1

mpharrigan commented 9 years ago

"fix"

peastman commented 9 years ago

Does the PDB format specify an encoding?

Not that I know of. Or rather, I don't think PDB files are ever supposed to contain non-ASCII characters.

rmcgibbo commented 9 years ago

Currently on py3, if you want to create a PdbStructure, you need to have input_stream opened to read strings, not bytes.
PDBFixer's doctest use an input_stream created from urllib.request.urlopen, which yields bytes.

One of these two things needs to change. I'm not sure which though.

rmcgibbo commented 9 years ago

(for file io, this is the difference between open(fn, 'r') and open(fn, 'rb'))

peastman commented 9 years ago

We always use 'r', not 'rb', so that's the problem.

rmcgibbo commented 9 years ago

No, that's why it works within OpenMM, because "r" yields strings, not bytes. It's only urllib that yields bytes, which is why you only see this issue in PDBFixer.

peastman commented 9 years ago

Yes, that's what I meant. The problem is that it expects a string, not bytes. How do we change the urlopen() call to get strings instead?

rmcgibbo commented 9 years ago

I don't think there's an easy one liner. You can do urllib.request.urlopen(url).read().decode('utf-8'), but then you need to stuff it back into a file-like object?

rmcgibbo commented 9 years ago

We could change OpenMM to respond appropriately to either strings or bytes too.

peastman commented 9 years ago

What would be the easiest way of doing that? PdbStructure reads lines from the PDB file with

for pdb_line in input_stream:

But if it's a stream of bytes, each iteration will get one byte instead of one line, right? Is there an easy way it can detect the type of stream and convert/wrap/whatever it?

rmcgibbo commented 9 years ago

The iterator still splits on bytes-newlines.

In [97]: [line for line in open('README.md', 'rb')]
Out[97]: 
[b'## [MSMBuilder: Statistical models for Biomolecular Dynamics](http://msmbuilder.org/)\n',
 b'\n',
 b'[![Build Status](https://travis-ci.org/msmbuilder/msmbuilder.png?branch=master)](https://travis-ci.org/msmbuilder/msmbuilder) [![PyPi version](https://pypip.in/v/msmbuilder/badge.png)](https://pypi.python.org/pypi/msmbuilder/) [![Supported Python versions](https://pypip.in/py_versions/msmbuilder/badge.svg)](https://pypi.python.org/pypi/msmbuilder/) [![License](https://img.shields.io/badge/license-LGPLv2.1+-red.svg?style=flat)](https://pypi.python.org/pypi/msmbuilder/)\n',
 b'[![Documentation Status](https://img.shields.io/badge/docs-latest-blue.svg?style=flat)](http://msmbuilder.org)\n',
 b'\n',
 b'MSMBuilder is a python package which implements a series of statistical models for high-dimensional time-series -- the particular focus of the library is on the  analysis of atomistic simulations of biomolecular dynamics such as protein folding and conformational change. MSMBuilder is available under the LGPL (v2.1 or later).\n',
 b'\n',
 b'Algorithms available include:\n',
 b'\n',
 b'- Geometric clustering\n',
 b'- Markov state models\n',
 b'- Time-structure independent components analysis\n',
 b'- L1-regularized reversible hidden Markov models\n',
 b'- Transition path theory\n',
 b'\n',
 b'Resources:\n',
 b'- [Documentation](http://msmbuilder.org)\n',
 b'- [Mailing List](https://mailman.stanford.edu/mailman/listinfo/msmbuilder-user)\n']

peastman commented 9 years ago

So if we just add "pdb_line = pdb_line.decode('utf-8')", will that work everywhere?

rmcgibbo commented 9 years ago

I think so.

swails commented 9 years ago

So if we just add "pdb_line = pdb_line.decode('utf-8')", will that work everywhere?

No. In Py3 if you open as 'r' instead of 'rb', you read strings. Strings in Py3 don't have a decode attribute:

>>> 'this is a string'.decode('utf-8')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
AttributeError: 'str' object has no attribute 'decode'
>>> b'these are bytes'.decode('utf-8')
'these are bytes'

This is, by far, the worst part about the Py2->Py3 transition. It ruins duck typing between file-like objects unless you just open everything with 'rb'... but yuck. For instance: gzip.open(file, 'r') yields bytes... open(file, 'r') yields strings.

I have a helper class in ParmEd that wraps around an open file-like object. See https://github.com/ParmEd/ParmEd/blob/master/chemistry/formats/io.py#L62-L103 for example.

rmcgibbo commented 9 years ago

just check the type before decoding it.

On Thu, Feb 26, 2015 at 4:23 PM, Jason Swails notifications@github.com wrote:

So if we just add "pdb_line = pdb_line.decode('utf-8')", will that work everywhere?

No. In Py3 if you open as 'r' instead of 'rb', you read strings. Strings in Py3 don't have a decode attribute:

'this is a string'.decode('utf-8') Traceback (most recent call last): File "", line 1, in AttributeError: 'str' object has no attribute 'decode'>>> b'these are bytes'.decode('utf-8')'these are bytes'

This is, by far, the worst part about the Py2->Py3 transition. It ruins duck typing between file-like objects unless you just open everything with 'rb'... but yuck. For instance: gzip.open(file, 'r') yields bytes... open(file, 'r') yields strings.

I have a helper class in ParmEd that wraps around an open file-like object. See https://github.com/ParmEd/ParmEd/blob/master/chemistry/formats/io.py#L62-L103 for example.

— Reply to this email directly or view it on GitHub https://github.com/pandegroup/pdbfixer/issues/81#issuecomment-76305778.

peastman commented 9 years ago

Ok, how about this:

if not isinstance(pdb_line, str):
    pdb_line = pdb_line.decode('utf-8')

That should work everywhere, right?

swails commented 9 years ago

How many places do you read a line from the file? You need to do this in each place. I have so many parsers in ParmEd that I needed to write a class to handle it more robustly

peastman commented 9 years ago

PdbStructure is the only one we're talking about here. There are lots of other file readers, but they aren't relevant to PDBFixer.

swails commented 9 years ago

I meant how many times to you read a line from a file inside PDBFixer? Or is it just one big for line in file type of thing?

peastman commented 9 years ago

You mean PdbStructure? It's just one big loop.

peastman commented 9 years ago

This is hopefully fixed by https://github.com/pandegroup/openmm/pull/839.

openmm / pdbfixer

String/Bytes with PDB Files #81