Closed mobiusklein closed 3 years ago
Thank you! I have tried this with the file from #33 and the slowdown in indexing is about ~10%, while the speedup in getting the first item is ~25x, which is great.
The parsing is still very slow though, with 6.4s retrieval time on my machine for the first item.
That "slow start" is because Iterfind
doesn't use the index when you iterate over it, only during entity lookup for _get_info
. Those six seconds are spent parsing through the first half of the mzIdentML file.
We could have xml.Iterfind.__iter__
loop over the index if the pattern is just the tag name like we do for xml.Iterfind.map
.
Now Iterfind
over an indexed tag will use the index when the index is a HierarchicalOffsetIndex
and a bunch of other preconditions are met.
Holy cow, I didn't realize we weren't doing that! This is much better!
Arguably we could split this into IndexedIterfind
so that it's easier to track the behavioral difference between XML
and IndexedXML
Done. Now trying to reader.iterfind(query).map(fn)
when query
isn't indexed leads to a more helpful error.
Thank you, this is great. I'm ready to merge this, was waiting for a reply from @jgriss but I'd rather merge and release shortly.
Just a minor question out of curiosity, why did you leave is_indexed
in Iterfind
rather than IndexedIterfind
? It's only called in the latter.
Is it for possible calls from user code? If I'm reading it correctly, is_indexed should always be False
when called from Iterfind
; I would have just moved it to the subclass and optionally have a constant False in Iterfind
.
I left it there because it required the fewest changes to the code to achieve the desired result, have an attribute in both classes that reflected whether the query was indexed already or not. I moved the implementation to be static the way you described for simplicity.
Since I sense this coming, I've made Iterfind
into almost a Sequence
, in that you can use x.iterind("Peptide")[123]
or slice it. When it's not indexed, it uses a slow itertools.islice
driven approach.
Thank you again. I hope I don't come across as trying to make you do all the work; I ask these questions because I value your input and I don't want to hijack your PR with changes that might not even be a good idea (now that I think of it, that's what suggestions in review mode are for; I'll remember to use them next time).
Indexing of Iterfind is neat! I don't think I would have thought of it myself, at least not now.
No worries. I implemented these extras for fun.
I think we talked about this a few years ago when implementing the Iterfind
and time
facades. The idea you proposed was to go even another step forward, to create attributes on readers that looked like a sequence-type but were really smart iterators like this. Something like MzML.spectra
and MzML.chromatograms
or MzIdentML.peptides
and MzIdentML.identifications
which would just become reader.iterfind("spectrum")
and reader.iterfind("chromatogram")
, or reader.iterfind("Peptide")
and reader.iterfind("SpectrumIdentificationResult")
.
I was hesitant to do this at the time because I thought we should be thinking about a wrapper around the raw dictionaries first, but after playing with the idea I couldn't find a satisfactory implementation that wouldn't break backwards compatibility.
Closes #33
The root cause in #33 is that when processing a
FragmentationArray
, themeasure_ref
attribute must be resolved. This leads to aget_by_id
call, but the requested ID is not in the index, so it leads to a sequential scan along the entire file. EachSpectrumIdentificationItem
contains at least threeFragmentationArray
, each of which references all threeMeasure
elements, leading to the seeming stand-still for a singleSpectrumIdentificationResult
To avoid this cropping up in the future, I've added all reference-able element types to the offset index. We'll see if this leads to an undesirable slowdown in initialization speed as it makes the indexer generate a more complex tokenizer pattern.