levitsky / pyteomics

Pyteomics is a collection of lightweight and handy tools for Python that help to handle various sorts of proteomics data. Pyteomics provides a growing set of modules to facilitate the most common tasks in proteomics data analysis.
http://pyteomics.readthedocs.io
Apache License 2.0
105 stars 34 forks source link

IndexSavingMixin._build_index() is never called by text parsers? #140

Closed levitsky closed 4 months ago

levitsky commented 5 months ago

Hi @mobiusklein, a quick question. I was looking at IndexSavingMixin and IndexedTextReader for ways to improve #138 and I noticed that the index saving functionality is implemented in _build_index(), which tries to read byte offsets from disk before falling back onto super()._build_index(). However, there is no such method in parent classes, as the index is built in build_byte_index, and so no method ever calls IndexSavingMixin._build_index(), either.

Is this perhaps an unintended result of some renaming in IndexedTextReader or am I misunderstanding something?

mobiusklein commented 5 months ago

It looks like the conceptual role that _build_index plays in the other type hierarchies is being played by IndexedTextReader.build_byte_index. I don't remember when that was done. I also don't see a clean way of fixing this that leaves derived implementations of build_byte_index intact, hooks into IndexSavingMixin, and doesn't involve essentially duplicating IndexSavingMixin or adding an extra layer of indirection. Before I dig any deeper, is that something you want to preserve still?

levitsky commented 5 months ago

Let me summarize the state of things as I see it because I don't exactly understand what you are saying.

We have _build_index which implements index saving but is not called by IndexedTextReader, so auto-loading from external index file does not work for indexed text-based parsers. However, IndexedXML defines and explicitly calls self._build_index (without loading from file), and then IndexSavingXML(IndexSavingMixin, IndexedXML) appears to have the right MRO to try the external index before calling IndexedXML._build_index, so that part of the class tree must work fine with offset index files. Also, PreIndexedMzML overrides it to first try and extract the embedded index before falling back to the superclass implementation. Apart from that, there is a namesake method in MzMLb, but that method has no connection with the others because MzMLb is not an IndexedXML at all.

To unify this behavior the obvious solution would be to rename buid_byte_index to _build_index or vice versa. Now what I think you are saying is that either change would break custom subclasses, because those on the text side would be overriding build_byte_index, while those on the XML side would be overriding _build_index, and we can't keep all of that working at the same time without extra work.

Intuitively it does not appear worth the effort to me, but that is because I have not really used index saving or overridden any of these methods and have not seen it done by others. (Edit: I tried searching on Github for any use of these methods and I couldn't find any). But I imagine you would have done at least some of it, and I imagine it would be XML-related. Do you have stuff dependent on that API?

If we do end up renaming and it comes down to which name to keep, one argument would be that build_byte_index is public and that should make it more reliable, but on the other hand _build_index is the fully functional one. If we change the behavior of build_byte_index by trying to read offset files first, issues with downstream subclasses are still possible. So perhaps I would suggest keeping the XML API intact and renaming build_byte_index.

Please correct me if I am overlooking anything.

mobiusklein commented 5 months ago

Your analysis is correct. I don't have any subclasses that override _build_index (I was surprised too), only the methods that IndexSavingMixin uses, so renaming that method shouldn't break my code. It should be safe to do. We could also try to introduce a __init_subclass__ to check for an implementation of _build_index and warn about it not working anymore. The __init_subclass__ method is less invasive than a metaclass, but is suitable since we're not trying to mutate the type at all:

class IndexedXML(...):
    ...

    def __init_subclass__(cls, **kwargs):
        super().__init_subclass__(**kwargs)
        if hasattr(cls, '_build_index'):
            warnings.warn("The method `_build_index` has been renamed to `build_byte_index`.")

If Python 2 support is still needed, then we'd have use a metaclass still since it lacks this method.


PreIndexedMzML and MzMLb are odd cases influenced by history.

For PreIndexedMzML, there was a period of time where MSConvert was writing incorrect byte offsets some of the time, so we had to fall back to doing a full file scan, or the file we were given was missing the indexedmzML wrapper. In either of those fallback cases, we might want to cache and reuse the scanning-built index, otherwise it's faster to just read the built-in one.

For MzMLb, there is also a built-in index that is even faster to load than PreIndexedMzML, it was introduced after that historical bug was solved, and so there was no value in making it able to save its index externally.

levitsky commented 5 months ago

Thank you. I think we can then rename the method (adjusting the behavior as one method just returns the index while the other assigns the instance attribute). Adding an __init_subclass__ is a good idea, and I don't think it's essential to ensure the warning on Python 2, as long as the code remains syntactically valid for Py2, it's fine with me if the method does not work properly there.