Index additional tags for mzIdentML

levitsky / pyteomics

Pyteomics is a collection of lightweight and handy tools for Python that help to handle various sorts of proteomics data. Pyteomics provides a growing set of modules to facilitate the most common tasks in proteomics data analysis.

http://pyteomics.readthedocs.io

Apache License 2.0

105 stars 34 forks source link

Index additional tags for mzIdentML #34

Closed mobiusklein closed 3 years ago

mobiusklein commented 3 years ago

Closes #33

The root cause in #33 is that when processing a FragmentationArray, the measure_ref attribute must be resolved. This leads to a get_by_id call, but the requested ID is not in the index, so it leads to a sequential scan along the entire file. Each SpectrumIdentificationItem contains at least three FragmentationArray, each of which references all three Measure elements, leading to the seeming stand-still for a single SpectrumIdentificationResult

To avoid this cropping up in the future, I've added all reference-able element types to the offset index. We'll see if this leads to an undesirable slowdown in initialization speed as it makes the indexer generate a more complex tokenizer pattern.

levitsky commented 3 years ago

Thank you! I have tried this with the file from #33 and the slowdown in indexing is about ~10%, while the speedup in getting the first item is ~25x, which is great.

The parsing is still very slow though, with 6.4s retrieval time on my machine for the first item.

mobiusklein commented 3 years ago

That "slow start" is because Iterfind doesn't use the index when you iterate over it, only during entity lookup for _get_info. Those six seconds are spent parsing through the first half of the mzIdentML file.

We could have xml.Iterfind.__iter__ loop over the index if the pattern is just the tag name like we do for xml.Iterfind.map.

mobiusklein commented 3 years ago

Now Iterfind over an indexed tag will use the index when the index is a HierarchicalOffsetIndex and a bunch of other preconditions are met.

levitsky commented 3 years ago

Holy cow, I didn't realize we weren't doing that! This is much better! Arguably we could split this into IndexedIterfind so that it's easier to track the behavioral difference between XML and IndexedXML

mobiusklein commented 3 years ago

Done. Now trying to reader.iterfind(query).map(fn) when query isn't indexed leads to a more helpful error.

levitsky commented 3 years ago

Thank you, this is great. I'm ready to merge this, was waiting for a reply from @jgriss but I'd rather merge and release shortly. Just a minor question out of curiosity, why did you leave is_indexed in Iterfind rather than IndexedIterfind? It's only called in the latter. Is it for possible calls from user code? If I'm reading it correctly, is_indexed should always be False when called from Iterfind; I would have just moved it to the subclass and optionally have a constant False in Iterfind.

mobiusklein commented 3 years ago

I left it there because it required the fewest changes to the code to achieve the desired result, have an attribute in both classes that reflected whether the query was indexed already or not. I moved the implementation to be static the way you described for simplicity.

Since I sense this coming, I've made Iterfind into almost a Sequence, in that you can use x.iterind("Peptide")[123] or slice it. When it's not indexed, it uses a slow itertools.islice driven approach.

levitsky commented 3 years ago

Thank you again. I hope I don't come across as trying to make you do all the work; I ask these questions because I value your input and I don't want to hijack your PR with changes that might not even be a good idea (now that I think of it, that's what suggestions in review mode are for; I'll remember to use them next time).

Indexing of Iterfind is neat! I don't think I would have thought of it myself, at least not now.

mobiusklein commented 3 years ago

No worries. I implemented these extras for fun.

I think we talked about this a few years ago when implementing the Iterfind and time facades. The idea you proposed was to go even another step forward, to create attributes on readers that looked like a sequence-type but were really smart iterators like this. Something like MzML.spectra and MzML.chromatograms or MzIdentML.peptides and MzIdentML.identifications which would just become reader.iterfind("spectrum") and reader.iterfind("chromatogram"), or reader.iterfind("Peptide") and reader.iterfind("SpectrumIdentificationResult").

I was hesitant to do this at the time because I thought we should be thinking about a wrapper around the raw dictionaries first, but after playing with the idea I couldn't find a satisfactory implementation that wouldn't break backwards compatibility.