Closed mobiusklein closed 1 year ago
Draft of early halting iterparse: note: No longer needed.
@xml._keepstate
def iterparse_until(source, target_name, quit_name):
'''Iteratively parse XML stream in ``source``, yielding XML elements
matching ``target_name``. If at any point a tag matching ``quit_name``
is encountered, stop parsing.
Parameters
----------
source: file-like
A file-like object over an XML document
tag_name: str
The name of the XML tag to parse until
quit_name: str
The name to stop parsing at.
Yields
------
lxml.etree.Element
'''
g = etree.iterparse(source, ('start', 'end'))
for event, tag in g:
if event == 'start':
if xml._local_name(tag) == quit_name:
break
else:
if xml._local_name(tag) == target_name:
yield tag
else:
tag.clear()
Rewrote how param groups are read on-demand to avoid doing a full scan of the file unless there is no referenceableParamGroup
with the requested id
. That simplifies things.
Thank you!
Do you think it's beneficial to index those refs? That way get_by_id
is supposed to seek immediately to the right place instead of a sequential scan. If those param groups are in the beginning, the savings could be negligible though.
Or am I missing something?
The param groups are always near the top of the file, so it's of limited value to index them.
Closes #94
This change adds another special case to
XML._get_info
to treatreferenceableParamGroupRef
similar to how we treatcvParam
anduserParam
, and then hands over responsibility for how that special handling is defined to the implementing subclass, so long as the result is aList[xml._XMLParam]
.The implementation for mzML defers parsing
referenceableParamGroup
until you encounter areferenceableParamGroupRef
and then does a random access seek from the start of the file parsing usingget_by_id
.Reasons why this might not be desirable:
seek
on (e.g.stdin
or asocket
), you can't jump back to the start of the file to read the param group definitions.As noted in the discussion, this behavior is irrelevant when used with
retrieve_refs=True
, but that is not on by default.