question about memory use with mzIdentML reader

levitsky / pyteomics

Pyteomics is a collection of lightweight and handy tools for Python that help to handle various sorts of proteomics data. Pyteomics provides a growing set of modules to facilitate the most common tasks in proteomics data analysis.

http://pyteomics.readthedocs.io

Apache License 2.0

105 stars 34 forks source link

question about memory use with mzIdentML reader #145

Open colin-combe opened 2 months ago

colin-combe commented 2 months ago

Hi, I'm using pyteomics' great mzIdentML parser to read crosslink data out of mzIdentML. There's one particular point in the code where memory use suddenly climbs:- https://github.com/PRIDE-Archive/xi-mzidentml-converter/blob/python3/parser/MzIdParser.py#L584 its when it executes the following (actually the second line):

        for sil_id in self.mzid_reader._offset_index["SpectrumIdentificationList"].keys():
            sil = self.mzid_reader.get_by_id(sil_id, tag_id='SpectrumIdentificationList')

why is it here the memory use climbs and is there a better way to iterate over SpectrumIdentificationLists?

Just wondering, thanks, Colin

colin-combe commented 2 months ago

its a stupid question i think, it's just coz the SpectrumIdentificationList is the biggest thing

colin-combe commented 2 months ago

i'm looking for a way to iterate of over SpectrumIdentificationResults but not lose the information about which SpectrumIdentificationList they come from. I can't find a way to do this so I'm getting each SpectrumIdentificationList and iterating through the SpectrumIdentifcationResults within it. Then i know the SIL they come from. But this means loading the whole SIL into memory and it becomes problematic for large files. Is there a better way? A way to iterate the SIR for one SIL without loading the whole SIL?

mobiusklein commented 2 months ago

This can be done using XPath:

from pyteomics import mzid

reader = mzid.MzIdentML("path/to/file.mzid", use_index=True)

sil_ids = list(reader.index['SpectrumIdentificationList'].keys())

for sil_id in sil_ids:
    reader.reset()
    for  sir in reader.iterfind(f"//SpectrumIdentificationList[@id=\"{sil_id}\"]/*):
        # do something

If you need to also access the other attributes on the SpectrumIdentificationList:

reader.reset()
sil_attributes = list(reader.iterfind("SpectrumIdentificationList", recursive=False))

This gives you a dictionary of the id attribute as well as whatever else the writer included.

colin-combe commented 2 months ago

Hi, thanks for the reply Joshua, this would be great!

I can't yet get it to work...

my code looks like this now:

        sil_ids = list(self.mzid_reader.index['SpectrumIdentificationList'].keys())
        # iterate over all the spectrum identification lists
        for sil_id in sil_ids:
            self.mzid_reader.reset()
            for sid_result in self.mzid_reader.iterfind(f"//SpectrumIdentificationList[@id=\"{sil_id}\"]/SpectrumIdentificationResult"):

First, I need to change the XPath selector because not all subelements of SpectrumIdentificationList are SpectrumIdentificationResults. However, when i replace the wildcard asterisk with SpectrumIdentificationResult it doesn't return anything. Which seems odd because i tested it with an online Xpath testing tool and i think it worked. (I'm far from being an expert on XPath.)

But the other problem is this - if I keep the wild card so stuff is returned by iterfind(), the memory use is the same as before. Like the iterfind() also loads the whole SIL into memory?

The exact changes to my code are here: https://github.com/Rappsilber-Laboratory/xi-mzidentml-converter/commit/1fc8613c55f6fed0f4ebf81d0f51bd8f3289522d

I changed the initialisation of the mzId reader to mzid.MzIdentML(self.mzid_path, use_index=True) so it was just the same as yours and commented out some function calls that only work with retreiveRefs = False (which is a bit mysterious to me).

Does anyone have any further advice on either how to fix my Xpath issue or on whether .iterfind() will actually also load whole SIL into memory?

Apologies in advance for silly mistakes, best wishes, Colin

@mobiusklein p.s. sorry i missed your talk at PSI meeting.

colin-combe commented 2 months ago

i made a separate GH issue for my xpath question - https://github.com/levitsky/pyteomics/issues/146

i'll look for a better way to examine whats going on with memory and iterfind()

colin-combe commented 2 months ago

@mobiusklein - i confirm it's like you say. iterfind() uses less memory.

Test 1 - iterfind()

Line #    Mem usage    Increment  Occurrences   Line Contents
=============================================================
     9     78.7 MiB     78.7 MiB           1   @profile
    10                                         def test1():
    11   1732.7 MiB      0.0 MiB           2       for sil_id in sil_ids:
    12     78.7 MiB      0.0 MiB           1           print("Processing SpectrumIdentificationList", sil_id)
    13     78.7 MiB      0.0 MiB           1           reader.reset()
    14     78.7 MiB      0.0 MiB           1           xpath = f"//SpectrumIdentificationList[@id=\"{sil_id}\"]/*"
    15     78.7 MiB      0.0 MiB           1           print("XPATH: ", xpath)
    16   1732.7 MiB   1654.0 MiB        3032           for sir in reader.iterfind(xpath):
    17                                                     # do something
    18                                                     # print(sir)
    19   1731.3 MiB      0.0 MiB        3031               pass
    20   1732.7 MiB      0.0 MiB           1       print("Test 1 done!")

Time taken: 1613.2808635234833

Test 2 - get_by_id()
Line #    Mem usage    Increment  Occurrences   Line Contents
=============================================================
     9     80.1 MiB     80.1 MiB           1   @profile
    10                                         def test2():
    11   2098.2 MiB      0.0 MiB           2       for sil_id in sil_ids:
    12     80.1 MiB      0.0 MiB           1           print("Processing SpectrumIdentificationList", sil_id)
    13   2098.2 MiB   2018.0 MiB           1           sil = reader.get_by_id(sil_id, tag_id='SpectrumIdentificationList')
    14   2098.2 MiB      0.0 MiB        3031           for sid_result in sil['SpectrumIdentificationResult']:
    15                                                     # do something
    16                                                     # print(sir)
    17   2098.2 MiB      0.0 MiB        3030               pass
    18   2098.2 MiB      0.0 MiB           1       print("Test 2 done!")

Time taken: 210.73944902420044

but not actually that much less memory, and i think the increase in processing time will introduce more problems than the reduction in memory use will solve.

Dunno if this makes sense as a third comparison, but in reader performs best:

Test 3 - sid in reader

Line #    Mem usage    Increment  Occurrences   Line Contents
=============================================================
     9     79.0 MiB     79.0 MiB           1   @profile
    10                                         def test3():
    11                                             # for sil_id in sil_ids:
    12                                             #     print("Processing SpectrumIdentificationList", sil_id)
    13                                             #     sil = reader.get_by_id(sil_id, tag_id='SpectrumIdentificationList')
    14     90.9 MiB     11.9 MiB        3031       for sid_result in reader:
    15                                                 # do something
    16                                                 # print(sir)
    17     90.9 MiB      0.0 MiB        3030           pass
    18     90.9 MiB      0.0 MiB           1       print("Test 3 done!")

Time taken: 166.0890302658081

what would be needed to get performance like in reader but without losing the information about which SIL the SIR came from?

mobiusklein commented 2 months ago

Yes, there's something wrong with the xpath matching, and because of how lxml implements xpath, it can't be used in a memory-efficient manner (https://lxml.de/xpathxslt.html).

Here's a helper function that will make things easier:

from pyteomics.xml import _local_name
from lxml import etree

def iterfind_when(source, target_name, condition_name, stack_predicate, **kwargs):
    """
    Iteratively parse XML stream in ``source``, yielding XML elements
    matching ``target_name`` as long as earlier in the tree a ``condition_name`` element
    satisfies ``stack_predicate``, a callable that takes a single :class:`etree.Element` and returns
    a :class:`bool`.

    Parameters
    ----------
    source: file-like
        A file-like object over an XML document
    target_name: str
        The name of the XML tag to parse until
    condition_name: str
        The name to start parsing at when `stack_predicate` evaluates to true on this element.
    stack_predicate: callable
        A function called with a single `etree.Element` that determines if the sub-tree should be parsed
    **kwargs:
        Additional arguments passed to :meth:`source._get_info_smart`

    Yields
    ------
    lxml.etree.Element
    """
    g = etree.iterparse(source, ("start", "end"))
    state = False
    for event, tag in g:
        lc_name = _local_name(tag)
        if event == "start":
            if lc_name == condition_name:
                state = stack_predicate(tag)
            if lc_name == target_name and state:
                yield source._get_info_smart(tag, **kwargs)
            else:
                tag.clear()
        else:
            tag.clear()

from pyteomics import mzid

reader = mzid.MzIdentML(r"./tests/test.mzid", use_index=True)
for e in iterfind_when(
    reader,
    "SpectrumIdentificationResult",
    "SpectrumIdentificationList",
    lambda x: x.attrib["id"] == "SEQUEST_results",
    retrieve_refs=False
):
    print(e)

It has the memory characteristic of iterfind, but lets you selectively filter on a single parent element similar to what you're looking for. More complex filtering could be done, but it's not clear if we need to go that far.

RE PS - Thank you, but it was just re-hashing the need to factor the modification database out of the mzIdentML document and that I need to spend more time finding implementers. We did want to find out if someone in your group would be willing to clean up a few things in XLMOD though.

colin-combe commented 2 months ago

@mobiusklein - i haven't integrated your function into my code yet, but from looking at it in test cases it appears to perform amazingly well. I'll let you know when i have it working in my larger piece of code.

I'll mention following up those two things from PSI meetinging to Juan (XLMOD maintenance has been asked for before, I guess "refactor modification database out of mzIdentML document" means getting rid of/reducing need for the ModificationParam's).

colin-combe commented 2 months ago

@mobiusklein - thanks btw, i didn't say that before.

I think i see some odd behaviour when i'm using the iterfind_when function. Occasionally the SpectrumIdentificationResults are missing their SpectrumIdentificationItems?

I think this example code: https://github.com/colin-combe/pyteomics-test/blob/master/test_iterfind_when.py will illustrate it if pointed at this file: https://ftp.pride.ebi.ac.uk/pride/data/archive/2020/10/PXD021417/XLpeplib_Beveridge_QEx-HFX_DSS_R3.mzid i can provide a shorter example file if you want (but its not short enough to upload to GH or put in GH project).

The example code is sometimes printing out "Has no SpectrumIdentificationItem?" but when i look up the SIR id in the xml file, it seems it does?

mobiusklein commented 2 months ago

I see the issue. My tests were small, and contained in whatever buffering lxml was doing. The following version of iterfind_when covers 100% of cases in the example file you shared:

def iterfind_when(source, target_name, condition_name, stack_predicate, **kwargs):
    """
    Iteratively parse XML stream in ``source``, yielding XML elements
    matching ``target_name`` as long as earlier in the tree a ``condition_name`` element
    satisfies ``stack_predicate``, a callable that takes a single :class:`etree.Element` and returns
    a :class:`bool`.

    Parameters
    ----------
    source: file-like
        A file-like object over an XML document
    target_name: str
        The name of the XML tag to parse until
    condition_name: str
        The name to start parsing at when `stack_predicate` evaluates to true on this element.
    stack_predicate: callable
        A function called with a single `etree.Element` that determines if the sub-tree should be parsed
    **kwargs:
        Additional arguments passed to :meth:`source._get_info_smart`

    Yields
    ------
    lxml.etree.Element
    """
    g = etree.iterparse(source, ("start", "end"))
    state = False
    for event, tag in g:
        lc_name = _local_name(tag)
        if event == "start":
            if lc_name == condition_name:
                state = stack_predicate(tag)
            if not (lc_name == target_name and state):
                tag.clear()
        else:
            if lc_name == target_name and state:
                yield source._get_info_smart(tag, **kwargs)
            tag.clear()

colin-combe commented 2 months ago

thanks for looking at it again, the SpectrumIdentificationItems are all empty objects now?

for me, with the new function, code like:

        for e in iterfind_when(
            reader,
            "SpectrumIdentificationResult",
            "SpectrumIdentificationList",
            lambda x: x.attrib["id"] == sil_id,
            retrieve_refs=False
        ):
            try:
               e['SpectrumIdentificationItem']
               print(f"Has SpectrumIdentificationItem?: {e['SpectrumIdentificationItem']}")
            except KeyError:
                print (f"Has no SpectrumIdentificationItem?: {e}")

is printing out:

Has SpectrumIdentificationItem?: [{}, {}, {}, {}, {}, {}, {}, {}]
Has SpectrumIdentificationItem?: [{}, {}, {}, {}, {}, {}, {}, {}, {}, {}]
Has SpectrumIdentificationItem?: [{}, {}]
Has SpectrumIdentificationItem?: [{}, {}, {}, {}, {}, {}, {}, {}]
Has SpectrumIdentificationItem?: [{}, {}, {}, {}]
etc.

mobiusklein commented 2 months ago

After actually interacting with the test script in the debugger, I found the issue. I can't say I understand why it's behaving this way. Here is a functional version:

def iterfind_when(source, target_name, condition_name, stack_predicate, **kwargs):
    """
    Iteratively parse XML stream in ``source``, yielding XML elements
    matching ``target_name`` as long as earlier in the tree a ``condition_name`` element
    satisfies ``stack_predicate``, a callable that takes a single :class:`etree.Element` and returns
    a :class:`bool`.

    Parameters
    ----------
    source: file-like
        A file-like object over an XML document
    target_name: str
        The name of the XML tag to parse until
    condition_name: str
        The name to start parsing at when `stack_predicate` evaluates to true on this element.
    stack_predicate: callable
        A function called with a single `etree.Element` that determines if the sub-tree should be parsed
    **kwargs:
        Additional arguments passed to :meth:`source._get_info_smart`

    Yields
    ------
    lxml.etree.Element
    """
    g = etree.iterparse(source, ("start", "end"))
    state = False
    history = []
    for event, tag in g:
        lc_name = _local_name(tag)
        if event == "start":
            if lc_name == condition_name:
                state = stack_predicate(tag)
        else:
            if lc_name == target_name and state:
                value = source._get_info_smart(tag, **kwargs)
                for t in history:
                    t.clear()
                history.clear()
                yield value
            elif state:
                history.append(tag)
            elif not state:
                tag.clear()

I'll spend more time working out why the state dependency is behaving this way.

colin-combe commented 2 months ago

@mobiusklein - yep, the above works. Awesome, thanks.

There was a delay getting back to you because the tests on the project were broken and I wanted to see the tests passing before declaring it worked. They passed - it works!

Can close this now, unless you're keeping it open whilst you try to work out "why the state dependency is behaving this way"

colin-combe commented 1 month ago

Hi Joshua (@mobiusklein) - i think i've found a problem with the above code.

there's a test case here: https://github.com/colin-combe/pyteomics-test/blob/master/test_iterfind_when.py

you'll see it's reading this file: https://github.com/colin-combe/pyteomics-test/blob/master/multiple_spectra_per_id_1_3_0_draft.mzid

which you'll recognise as an example file from mzId 1.3, so the obvious thing to think it that it doesn't work because it's MzIdentML 1.3.0. But I've changed it (schema reference and delete the cv param for extension document) so I think it should be valid 1.2.0. I don't think this is the problem, but could be mistaken.

It crashes with error:

Processing SpectrumIdentificationList sil_HCD
Traceback (most recent call last):
  File "/home/cc/work/pyteomics-test/test_iterfind_when.py", line 80, in <module>
    test3()
  File "/home/cc/work/pyteomics-test/test_iterfind_when.py", line 61, in test3
    for e in iterfind_when(
  File "/home/cc/work/pyteomics-test/test_iterfind_when.py", line 41, in iterfind_when
    value = source._get_info_smart(tag, **kwargs)
  File "/home/cc/.local/share/virtualenvs/pyteomics-test-9DSZxrcY/lib/python3.10/site-packages/pyteomics/mzid.py", line 158, in _get_info_smart
    return self._get_info(element,
  File "/home/cc/.local/share/virtualenvs/pyteomics-test-9DSZxrcY/lib/python3.10/site-packages/pyteomics/xml.py", line 435, in _get_info
    self._get_info_smart(child, ename=cname, **kwargs))
  File "/home/cc/.local/share/virtualenvs/pyteomics-test-9DSZxrcY/lib/python3.10/site-packages/pyteomics/mzid.py", line 158, in _get_info_smart
    return self._get_info(element,
  File "/home/cc/.local/share/virtualenvs/pyteomics-test-9DSZxrcY/lib/python3.10/site-packages/pyteomics/xml.py", line 424, in _get_info
    cname = _local_name(child)
  File "/home/cc/.local/share/virtualenvs/pyteomics-test-9DSZxrcY/lib/python3.10/site-packages/pyteomics/xml.py", line 53, in _local_name
    if tag and tag[0] == '{':
TypeError: '_cython_3_0_10.cython_function_or_method' object is not subscriptable

any ideas? cheers, Colin

mobiusklein commented 1 month ago

This is because lxml treats comments as first class elements of the document, but they don't have the same interface as etree.Element. The fix is easy, fortunately. In iterfind_when replace

g = etree.iterparse(source, ("start", "end"))

with

g = etree.iterparse(source, ("start", "end"), remove_comments=True)

and it should work.

colin-combe commented 1 month ago

That does indeed fix it. Thanks, Colin