Closed levitsky closed 4 months ago
It looks like the conceptual role that _build_index
plays in the other type hierarchies is being played by IndexedTextReader.build_byte_index
. I don't remember when that was done. I also don't see a clean way of fixing this that leaves derived implementations of build_byte_index
intact, hooks into IndexSavingMixin
, and doesn't involve essentially duplicating IndexSavingMixin
or adding an extra layer of indirection. Before I dig any deeper, is that something you want to preserve still?
Let me summarize the state of things as I see it because I don't exactly understand what you are saying.
We have _build_index
which implements index saving but is not called by IndexedTextReader
, so auto-loading from external index file does not work for indexed text-based parsers.
However, IndexedXML
defines and explicitly calls self._build_index
(without loading from file), and then IndexSavingXML(IndexSavingMixin, IndexedXML)
appears to have the right MRO to try the external index before calling IndexedXML._build_index
, so that part of the class tree must work fine with offset index files.
Also, PreIndexedMzML
overrides it to first try and extract the embedded index before falling back to the superclass implementation.
Apart from that, there is a namesake method in MzMLb
, but that method has no connection with the others because MzMLb
is not an IndexedXML
at all.
To unify this behavior the obvious solution would be to rename buid_byte_index
to _build_index
or vice versa. Now what I think you are saying is that either change would break custom subclasses, because those on the text side would be overriding build_byte_index
, while those on the XML side would be overriding _build_index
, and we can't keep all of that working at the same time without extra work.
Intuitively it does not appear worth the effort to me, but that is because I have not really used index saving or overridden any of these methods and have not seen it done by others. (Edit: I tried searching on Github for any use of these methods and I couldn't find any). But I imagine you would have done at least some of it, and I imagine it would be XML-related. Do you have stuff dependent on that API?
If we do end up renaming and it comes down to which name to keep, one argument would be that build_byte_index
is public and that should make it more reliable, but on the other hand _build_index
is the fully functional one. If we change the behavior of build_byte_index
by trying to read offset files first, issues with downstream subclasses are still possible. So perhaps I would suggest keeping the XML API intact and renaming build_byte_index
.
Please correct me if I am overlooking anything.
Your analysis is correct. I don't have any subclasses that override _build_index
(I was surprised too), only the methods that IndexSavingMixin
uses, so renaming that method shouldn't break my code. It should be safe to do. We could also try to introduce a __init_subclass__
to check for an implementation of _build_index
and warn about it not working anymore. The __init_subclass__
method is less invasive than a metaclass, but is suitable since we're not trying to mutate the type at all:
class IndexedXML(...):
...
def __init_subclass__(cls, **kwargs):
super().__init_subclass__(**kwargs)
if hasattr(cls, '_build_index'):
warnings.warn("The method `_build_index` has been renamed to `build_byte_index`.")
If Python 2 support is still needed, then we'd have use a metaclass still since it lacks this method.
PreIndexedMzML
and MzMLb
are odd cases influenced by history.
For PreIndexedMzML
, there was a period of time where MSConvert was writing incorrect byte offsets some of the time, so we had to fall back to doing a full file scan, or the file we were given was missing the indexedmzML
wrapper. In either of those fallback cases, we might want to cache and reuse the scanning-built index, otherwise it's faster to just read the built-in one.
For MzMLb
, there is also a built-in index that is even faster to load than PreIndexedMzML
, it was introduced after that historical bug was solved, and so there was no value in making it able to save its index externally.
Thank you. I think we can then rename the method (adjusting the behavior as one method just returns the index while the other assigns the instance attribute). Adding an __init_subclass__
is a good idea, and I don't think it's essential to ensure the warning on Python 2, as long as the code remains syntactically valid for Py2, it's fine with me if the method does not work properly there.
Hi @mobiusklein, a quick question. I was looking at
IndexSavingMixin
andIndexedTextReader
for ways to improve #138 and I noticed that the index saving functionality is implemented in_build_index()
, which tries to read byte offsets from disk before falling back ontosuper()._build_index()
. However, there is no such method in parent classes, as the index is built inbuild_byte_index
, and so no method ever callsIndexSavingMixin._build_index()
, either.Is this perhaps an unintended result of some renaming in
IndexedTextReader
or am I misunderstanding something?