levitsky / pyteomics

Pyteomics is a collection of lightweight and handy tools for Python that help to handle various sorts of proteomics data. Pyteomics provides a growing set of modules to facilitate the most common tasks in proteomics data analysis.
http://pyteomics.readthedocs.io
Apache License 2.0
120 stars 37 forks source link

xpath in mzIdentML reader behaves differently from other xpath implementations? #146

Closed colin-combe closed 6 months ago

colin-combe commented 7 months ago

Hi,

xpath in the pyteomics mzIdentML reader is behaving differently for me from other xpath implementations?

I made a small example project to compare xpath behaviour from the pyteomics mzIdentML reader and a perl xpath implementation - https://github.com/colin-combe/pyteomics-test.

There are text files in the project showing the results i'm getting, but basically for perl xpath:

xpath -e '//SpectrumIdentificationList[@id="SIL_1572215611447534775"]/*' test.mzid
Found 2 nodes in test.mzid:

xpath -e '//SpectrumIdentificationList[@id="SIL_1572215611447534775"]/SpectrumIdentificationResult' test.mzid
Found 1 nodes in test.mzid:

When using xpath in the pyteomics mzIdentML reader with the same file (see https://github.com/colin-combe/pyteomics-test/blob/master/test_xpath.py), the first xpath selector above return 2 nodes, but the second returns zero?

best wishes, Colin

levitsky commented 7 months ago

Hi, thanks for isolating this issue. The iterfind method performs a somewhat naive preprocessing of the XPath which aims to help with element namespaces:

path : str Element name or XPath-like expression. The path is very close to full XPath syntax, but local names should be used for all elements in the path. They will be substituted with local-name() checks, up to the (first) predicate. The path can be absolute or "free". Please don't specify namespaces.

The problem arises when there is an element after a predicate, like in the second test.

We should definitely look again at the path preprocessing implementation here. I'm pretty sure the "first predicate" limitation can be avoided, but perhaps a more sound approach is needed, like auto-detection of namespaces. I stopped using namespaces and switched to local-name() checks in 2012, and I don't remember why. We also have an xpath() function that inserts namespaces into a query and we don't use it here.

As a workaround, the following should work with the current implementation:

'//SpectrumIdentificationList[@id="SIL_1572215611447534775"]/*[local-name()="SpectrumIdentificationResult"]'
colin-combe commented 7 months ago

thanks

colin-combe commented 6 months ago

see also https://github.com/levitsky/pyteomics/issues/145