levitsky / pyteomics

Pyteomics is a collection of lightweight and handy tools for Python that help to handle various sorts of proteomics data. Pyteomics provides a growing set of modules to facilitate the most common tasks in proteomics data analysis.
http://pyteomics.readthedocs.io
Apache License 2.0
105 stars 34 forks source link

Resolve `referenceableParamGroupRef` #95

Closed mobiusklein closed 1 year ago

mobiusklein commented 1 year ago

Closes #94

This change adds another special case to XML._get_info to treat referenceableParamGroupRef similar to how we treat cvParam and userParam, and then hands over responsibility for how that special handling is defined to the implementing subclass, so long as the result is a List[xml._XMLParam].

The implementation for mzML defers parsing referenceableParamGroup until you encounter a referenceableParamGroupRef and then does a random access seek from the start of the file parsing using get_by_id.

Reasons why this might not be desirable:

  1. When you are reading an stream you cannot call seek on (e.g. stdin or a socket), you can't jump back to the start of the file to read the param group definitions.

As noted in the discussion, this behavior is irrelevant when used with retrieve_refs=True, but that is not on by default.

mobiusklein commented 1 year ago

Draft of early halting iterparse: note: No longer needed.

@xml._keepstate
def iterparse_until(source, target_name, quit_name):
    '''Iteratively parse XML stream in ``source``, yielding XML elements
    matching ``target_name``. If at any point a tag matching ``quit_name``
    is encountered, stop parsing.

    Parameters
    ----------
    source: file-like
        A file-like object over an XML document
    tag_name: str
        The name of the XML tag to parse until
    quit_name: str
        The name to stop parsing at.

    Yields
    ------
    lxml.etree.Element
    '''
    g = etree.iterparse(source, ('start', 'end'))
    for event, tag in g:
        if event == 'start':
            if xml._local_name(tag) == quit_name:
                break
            else:
                if xml._local_name(tag) == target_name:
                    yield tag
                else:
                    tag.clear()
mobiusklein commented 1 year ago

Rewrote how param groups are read on-demand to avoid doing a full scan of the file unless there is no referenceableParamGroup with the requested id. That simplifies things.

levitsky commented 1 year ago

Thank you! Do you think it's beneficial to index those refs? That way get_by_id is supposed to seek immediately to the right place instead of a sequential scan. If those param groups are in the beginning, the savings could be negligible though.

Or am I missing something?

mobiusklein commented 1 year ago

The param groups are always near the top of the file, so it's of limited value to index them.