levitsky / pyteomics

Pyteomics is a collection of lightweight and handy tools for Python that help to handle various sorts of proteomics data. Pyteomics provides a growing set of modules to facilitate the most common tasks in proteomics data analysis.
http://pyteomics.readthedocs.io
Apache License 2.0
105 stars 34 forks source link

Loading Byte Offset Indices From File-like Objects #1

Closed mobiusklein closed 4 years ago

mobiusklein commented 4 years ago

When we load a pre-built index, we do a simple file existence test based upon a fixed name lookup on the mounted file system https://github.com/levitsky/pyteomics/blob/a764d47d8929f9fc75bb44fd0bf1ddd8be8b4bae/pyteomics/auxiliary/file_helpers.py#L478-L490

This works well enough, though it does have a bit of trouble with file-like objects that don't actually have names. I recently experimented with a random access file-like interface over HTTP(S), FTP, and AWS S3, where the file-like object has a name, but doesn't reference the file system and won't work with builtin open. To get around this, I want to add a new keyword argument to IndexSavingXML, index_file, that lets the caller pass a different name or a file-like object to use to load the offset index.

This parameter would be None by default and would result in the normal behavior when not present. It would complicate the index loading process a bit. Before diving into implementing it, I wanted to see if there was anything I was overlooking.

levitsky commented 4 years ago

I can't think of any complications, but maybe just a couple of questions I'd throw in: looks like this additional flexibility can be achieved either by adding extra arguments or by a slight refactoring so that the IndexSavingXML can be more easily subclased, e.g. providing a _get_byte_index_file_object() or something like that, which can be overridden. I'm not saying it's necessarily better.

One reason why I'm asking is this: I have no idea how those file-like objects that you have in mind work, but (given that they handle network connections of some sort) are they persistent enough that we can rely on them being operational for the lifetime of the parser object? The way index-saving parsers work, they read their index on construction but write it at an arbitrary later time, on demand. Should we be concerned that by that time the file-like object is no longer operational? And would it make sense to organize things so that a new connection is established (and a new file-like handle created) at the time it is needed?

(Even if you think it makes sense, we still don't have to use inheritance. These are two separate questions, now that I have typed the above. But squeezing a whole custom-handle-instantiation signature into the reader constructor does seem suboptimal, as opposed to providing an extra method. We also need some room for error-handling logic, if it is not entirely encapsulated in those file-like classes.)

Edit: you actually only mentioned reading, so perhaps I shouldn't have assumed that those objects are even writable. Then we'd need to handle that in the saving method, too.

mobiusklein commented 4 years ago

I hadn't thought about writing the index through a file-like object. Each protocol handles writing differently, or not at all. They establish new connections on-demand, and tear them down after each read, which can be expensive compared to a file handle open, but necessary for random access reads on these protocols.

What I had in mind was that the file-like object was passed to __init__, the index is read on instantiation, and then the file-like object is either closed or lingers on, unused. This assumed the index file already existed, and wasn't being used to just re-direct the index to a non-canonical name. The proposed method wouldn't be responsible for opening/closing connections, just reading from an open file-like object. Attempting to open new connections would require a custom file opening protocol, which nobody can do right 100% of the time (opens the authentication can of worms).

You made a good point about how to implement this less painfully with inheritance. I could just override _check_has_byte_offset_file and _read_byte_offsets to check the presence of the index file handle attribute.

mobiusklein commented 4 years ago

This override seems to do the job:

from pyteomics import mzml

def _open_if_not_file(obj, mode='rt'):
    if obj is None:
        return obj
    if hasattr(obj, 'read'):
        return obj
    return open(obj, mode)

class _MzMLParser(mzml.MzML):
    def __init__(self, *args, **kwargs):
        self._index_file_obj = _open_if_not_file(kwargs.pop("index_file", None))
        super(_MzMLParser, self).__init__(*args, **kwargs)

    def _check_has_byte_offset_file(self):
        if self._index_file_obj is not None:
            return True
        return super(_MzMLParser, self)._check_has_byte_offset_file()

    def _read_byte_offsets(self):
        if self._index_file_obj is not None:
            index = self._index_class.load(self._index_file_obj)
            try:
                self._index_file_obj.close()
            except (AttributeError, OSError, IOError):
                pass
            self._offset_index = index
        else:
            super(_MzMLParser, self)._read_byte_offsets()

I already use a sub-classed mzml.MzML object, so that's not an issue for me to add myself. I can close this issue without adding the complexity upstream.