Open colin-combe opened 2 months ago
its a stupid question i think, it's just coz the SpectrumIdentificationList is the biggest thing
i'm looking for a way to iterate of over SpectrumIdentificationResults but not lose the information about which SpectrumIdentificationList they come from. I can't find a way to do this so I'm getting each SpectrumIdentificationList and iterating through the SpectrumIdentifcationResults within it. Then i know the SIL they come from. But this means loading the whole SIL into memory and it becomes problematic for large files. Is there a better way? A way to iterate the SIR for one SIL without loading the whole SIL?
This can be done using XPath:
from pyteomics import mzid
reader = mzid.MzIdentML("path/to/file.mzid", use_index=True)
sil_ids = list(reader.index['SpectrumIdentificationList'].keys())
for sil_id in sil_ids:
reader.reset()
for sir in reader.iterfind(f"//SpectrumIdentificationList[@id=\"{sil_id}\"]/*):
# do something
If you need to also access the other attributes on the SpectrumIdentificationList
:
reader.reset()
sil_attributes = list(reader.iterfind("SpectrumIdentificationList", recursive=False))
This gives you a dictionary of the id
attribute as well as whatever else the writer included.
Hi, thanks for the reply Joshua, this would be great!
I can't yet get it to work...
my code looks like this now:
sil_ids = list(self.mzid_reader.index['SpectrumIdentificationList'].keys())
# iterate over all the spectrum identification lists
for sil_id in sil_ids:
self.mzid_reader.reset()
for sid_result in self.mzid_reader.iterfind(f"//SpectrumIdentificationList[@id=\"{sil_id}\"]/SpectrumIdentificationResult"):
First, I need to change the XPath selector because not all subelements of SpectrumIdentificationList are SpectrumIdentificationResults. However, when i replace the wildcard asterisk with SpectrumIdentificationResult it doesn't return anything. Which seems odd because i tested it with an online Xpath testing tool and i think it worked. (I'm far from being an expert on XPath.)
But the other problem is this - if I keep the wild card so stuff is returned by iterfind(), the memory use is the same as before. Like the iterfind() also loads the whole SIL into memory?
The exact changes to my code are here: https://github.com/Rappsilber-Laboratory/xi-mzidentml-converter/commit/1fc8613c55f6fed0f4ebf81d0f51bd8f3289522d
I changed the initialisation of the mzId reader to mzid.MzIdentML(self.mzid_path, use_index=True)
so it was just the same as yours and commented out some function calls that only work with retreiveRefs = False
(which is a bit mysterious to me).
Does anyone have any further advice on either how to fix my Xpath issue or on whether .iterfind() will actually also load whole SIL into memory?
Apologies in advance for silly mistakes, best wishes, Colin
@mobiusklein p.s. sorry i missed your talk at PSI meeting.
i made a separate GH issue for my xpath question - https://github.com/levitsky/pyteomics/issues/146
i'll look for a better way to examine whats going on with memory and iterfind()
@mobiusklein - i confirm it's like you say. iterfind() uses less memory.
Test 1 - iterfind()
Line # Mem usage Increment Occurrences Line Contents
=============================================================
9 78.7 MiB 78.7 MiB 1 @profile
10 def test1():
11 1732.7 MiB 0.0 MiB 2 for sil_id in sil_ids:
12 78.7 MiB 0.0 MiB 1 print("Processing SpectrumIdentificationList", sil_id)
13 78.7 MiB 0.0 MiB 1 reader.reset()
14 78.7 MiB 0.0 MiB 1 xpath = f"//SpectrumIdentificationList[@id=\"{sil_id}\"]/*"
15 78.7 MiB 0.0 MiB 1 print("XPATH: ", xpath)
16 1732.7 MiB 1654.0 MiB 3032 for sir in reader.iterfind(xpath):
17 # do something
18 # print(sir)
19 1731.3 MiB 0.0 MiB 3031 pass
20 1732.7 MiB 0.0 MiB 1 print("Test 1 done!")
Time taken: 1613.2808635234833
Test 2 - get_by_id()
Line # Mem usage Increment Occurrences Line Contents
=============================================================
9 80.1 MiB 80.1 MiB 1 @profile
10 def test2():
11 2098.2 MiB 0.0 MiB 2 for sil_id in sil_ids:
12 80.1 MiB 0.0 MiB 1 print("Processing SpectrumIdentificationList", sil_id)
13 2098.2 MiB 2018.0 MiB 1 sil = reader.get_by_id(sil_id, tag_id='SpectrumIdentificationList')
14 2098.2 MiB 0.0 MiB 3031 for sid_result in sil['SpectrumIdentificationResult']:
15 # do something
16 # print(sir)
17 2098.2 MiB 0.0 MiB 3030 pass
18 2098.2 MiB 0.0 MiB 1 print("Test 2 done!")
Time taken: 210.73944902420044
but not actually that much less memory, and i think the increase in processing time will introduce more problems than the reduction in memory use will solve.
Dunno if this makes sense as a third comparison, but in reader
performs best:
Test 3 - sid in reader
Line # Mem usage Increment Occurrences Line Contents
=============================================================
9 79.0 MiB 79.0 MiB 1 @profile
10 def test3():
11 # for sil_id in sil_ids:
12 # print("Processing SpectrumIdentificationList", sil_id)
13 # sil = reader.get_by_id(sil_id, tag_id='SpectrumIdentificationList')
14 90.9 MiB 11.9 MiB 3031 for sid_result in reader:
15 # do something
16 # print(sir)
17 90.9 MiB 0.0 MiB 3030 pass
18 90.9 MiB 0.0 MiB 1 print("Test 3 done!")
Time taken: 166.0890302658081
what would be needed to get performance like in reader
but without losing the information about which SIL the SIR came from?
Yes, there's something wrong with the xpath
matching, and because of how lxml
implements xpath
, it can't be used in a memory-efficient manner (https://lxml.de/xpathxslt.html).
Here's a helper function that will make things easier:
from pyteomics.xml import _local_name
from lxml import etree
def iterfind_when(source, target_name, condition_name, stack_predicate, **kwargs):
"""
Iteratively parse XML stream in ``source``, yielding XML elements
matching ``target_name`` as long as earlier in the tree a ``condition_name`` element
satisfies ``stack_predicate``, a callable that takes a single :class:`etree.Element` and returns
a :class:`bool`.
Parameters
----------
source: file-like
A file-like object over an XML document
target_name: str
The name of the XML tag to parse until
condition_name: str
The name to start parsing at when `stack_predicate` evaluates to true on this element.
stack_predicate: callable
A function called with a single `etree.Element` that determines if the sub-tree should be parsed
**kwargs:
Additional arguments passed to :meth:`source._get_info_smart`
Yields
------
lxml.etree.Element
"""
g = etree.iterparse(source, ("start", "end"))
state = False
for event, tag in g:
lc_name = _local_name(tag)
if event == "start":
if lc_name == condition_name:
state = stack_predicate(tag)
if lc_name == target_name and state:
yield source._get_info_smart(tag, **kwargs)
else:
tag.clear()
else:
tag.clear()
from pyteomics import mzid
reader = mzid.MzIdentML(r"./tests/test.mzid", use_index=True)
for e in iterfind_when(
reader,
"SpectrumIdentificationResult",
"SpectrumIdentificationList",
lambda x: x.attrib["id"] == "SEQUEST_results",
retrieve_refs=False
):
print(e)
It has the memory characteristic of iterfind
, but lets you selectively filter on a single parent element similar to what you're looking for. More complex filtering could be done, but it's not clear if we need to go that far.
RE PS - Thank you, but it was just re-hashing the need to factor the modification database out of the mzIdentML document and that I need to spend more time finding implementers. We did want to find out if someone in your group would be willing to clean up a few things in XLMOD though.
@mobiusklein - i haven't integrated your function into my code yet, but from looking at it in test cases it appears to perform amazingly well. I'll let you know when i have it working in my larger piece of code.
I'll mention following up those two things from PSI meetinging to Juan (XLMOD maintenance has been asked for before, I guess "refactor modification database out of mzIdentML document" means getting rid of/reducing need for the ModificationParam's).
@mobiusklein - thanks btw, i didn't say that before.
I think i see some odd behaviour when i'm using the iterfind_when
function.
Occasionally the SpectrumIdentificationResults are missing their SpectrumIdentificationItems?
I think this example code: https://github.com/colin-combe/pyteomics-test/blob/master/test_iterfind_when.py will illustrate it if pointed at this file: https://ftp.pride.ebi.ac.uk/pride/data/archive/2020/10/PXD021417/XLpeplib_Beveridge_QEx-HFX_DSS_R3.mzid i can provide a shorter example file if you want (but its not short enough to upload to GH or put in GH project).
The example code is sometimes printing out "Has no SpectrumIdentificationItem?" but when i look up the SIR id in the xml file, it seems it does?
I see the issue. My tests were small, and contained in whatever buffering lxml
was doing. The following version of iterfind_when
covers 100% of cases in the example file you shared:
def iterfind_when(source, target_name, condition_name, stack_predicate, **kwargs):
"""
Iteratively parse XML stream in ``source``, yielding XML elements
matching ``target_name`` as long as earlier in the tree a ``condition_name`` element
satisfies ``stack_predicate``, a callable that takes a single :class:`etree.Element` and returns
a :class:`bool`.
Parameters
----------
source: file-like
A file-like object over an XML document
target_name: str
The name of the XML tag to parse until
condition_name: str
The name to start parsing at when `stack_predicate` evaluates to true on this element.
stack_predicate: callable
A function called with a single `etree.Element` that determines if the sub-tree should be parsed
**kwargs:
Additional arguments passed to :meth:`source._get_info_smart`
Yields
------
lxml.etree.Element
"""
g = etree.iterparse(source, ("start", "end"))
state = False
for event, tag in g:
lc_name = _local_name(tag)
if event == "start":
if lc_name == condition_name:
state = stack_predicate(tag)
if not (lc_name == target_name and state):
tag.clear()
else:
if lc_name == target_name and state:
yield source._get_info_smart(tag, **kwargs)
tag.clear()
thanks for looking at it again, the SpectrumIdentificationItems are all empty objects now?
for me, with the new function, code like:
for e in iterfind_when(
reader,
"SpectrumIdentificationResult",
"SpectrumIdentificationList",
lambda x: x.attrib["id"] == sil_id,
retrieve_refs=False
):
try:
e['SpectrumIdentificationItem']
print(f"Has SpectrumIdentificationItem?: {e['SpectrumIdentificationItem']}")
except KeyError:
print (f"Has no SpectrumIdentificationItem?: {e}")
is printing out:
Has SpectrumIdentificationItem?: [{}, {}, {}, {}, {}, {}, {}, {}]
Has SpectrumIdentificationItem?: [{}, {}, {}, {}, {}, {}, {}, {}, {}, {}]
Has SpectrumIdentificationItem?: [{}, {}]
Has SpectrumIdentificationItem?: [{}, {}, {}, {}, {}, {}, {}, {}]
Has SpectrumIdentificationItem?: [{}, {}, {}, {}]
etc.
After actually interacting with the test script in the debugger, I found the issue. I can't say I understand why it's behaving this way. Here is a functional version:
def iterfind_when(source, target_name, condition_name, stack_predicate, **kwargs):
"""
Iteratively parse XML stream in ``source``, yielding XML elements
matching ``target_name`` as long as earlier in the tree a ``condition_name`` element
satisfies ``stack_predicate``, a callable that takes a single :class:`etree.Element` and returns
a :class:`bool`.
Parameters
----------
source: file-like
A file-like object over an XML document
target_name: str
The name of the XML tag to parse until
condition_name: str
The name to start parsing at when `stack_predicate` evaluates to true on this element.
stack_predicate: callable
A function called with a single `etree.Element` that determines if the sub-tree should be parsed
**kwargs:
Additional arguments passed to :meth:`source._get_info_smart`
Yields
------
lxml.etree.Element
"""
g = etree.iterparse(source, ("start", "end"))
state = False
history = []
for event, tag in g:
lc_name = _local_name(tag)
if event == "start":
if lc_name == condition_name:
state = stack_predicate(tag)
else:
if lc_name == target_name and state:
value = source._get_info_smart(tag, **kwargs)
for t in history:
t.clear()
history.clear()
yield value
elif state:
history.append(tag)
elif not state:
tag.clear()
I'll spend more time working out why the state dependency is behaving this way.
@mobiusklein - yep, the above works. Awesome, thanks.
There was a delay getting back to you because the tests on the project were broken and I wanted to see the tests passing before declaring it worked. They passed - it works!
Can close this now, unless you're keeping it open whilst you try to work out "why the state dependency is behaving this way"
Hi Joshua (@mobiusklein) - i think i've found a problem with the above code.
there's a test case here: https://github.com/colin-combe/pyteomics-test/blob/master/test_iterfind_when.py
you'll see it's reading this file: https://github.com/colin-combe/pyteomics-test/blob/master/multiple_spectra_per_id_1_3_0_draft.mzid
which you'll recognise as an example file from mzId 1.3, so the obvious thing to think it that it doesn't work because it's MzIdentML 1.3.0. But I've changed it (schema reference and delete the cv param for extension document) so I think it should be valid 1.2.0. I don't think this is the problem, but could be mistaken.
It crashes with error:
Processing SpectrumIdentificationList sil_HCD
Traceback (most recent call last):
File "/home/cc/work/pyteomics-test/test_iterfind_when.py", line 80, in <module>
test3()
File "/home/cc/work/pyteomics-test/test_iterfind_when.py", line 61, in test3
for e in iterfind_when(
File "/home/cc/work/pyteomics-test/test_iterfind_when.py", line 41, in iterfind_when
value = source._get_info_smart(tag, **kwargs)
File "/home/cc/.local/share/virtualenvs/pyteomics-test-9DSZxrcY/lib/python3.10/site-packages/pyteomics/mzid.py", line 158, in _get_info_smart
return self._get_info(element,
File "/home/cc/.local/share/virtualenvs/pyteomics-test-9DSZxrcY/lib/python3.10/site-packages/pyteomics/xml.py", line 435, in _get_info
self._get_info_smart(child, ename=cname, **kwargs))
File "/home/cc/.local/share/virtualenvs/pyteomics-test-9DSZxrcY/lib/python3.10/site-packages/pyteomics/mzid.py", line 158, in _get_info_smart
return self._get_info(element,
File "/home/cc/.local/share/virtualenvs/pyteomics-test-9DSZxrcY/lib/python3.10/site-packages/pyteomics/xml.py", line 424, in _get_info
cname = _local_name(child)
File "/home/cc/.local/share/virtualenvs/pyteomics-test-9DSZxrcY/lib/python3.10/site-packages/pyteomics/xml.py", line 53, in _local_name
if tag and tag[0] == '{':
TypeError: '_cython_3_0_10.cython_function_or_method' object is not subscriptable
any ideas? cheers, Colin
This is because lxml
treats comments as first class elements of the document, but they don't have the same interface as etree.Element
. The fix is easy, fortunately. In iterfind_when
replace
g = etree.iterparse(source, ("start", "end"))
with
g = etree.iterparse(source, ("start", "end"), remove_comments=True)
and it should work.
That does indeed fix it. Thanks, Colin
Hi, I'm using pyteomics' great mzIdentML parser to read crosslink data out of mzIdentML. There's one particular point in the code where memory use suddenly climbs:- https://github.com/PRIDE-Archive/xi-mzidentml-converter/blob/python3/parser/MzIdParser.py#L584 its when it executes the following (actually the second line):
why is it here the memory use climbs and is there a better way to iterate over SpectrumIdentificationLists?
Just wondering, thanks, Colin