levitsky / pyteomics

Pyteomics is a collection of lightweight and handy tools for Python that help to handle various sorts of proteomics data. Pyteomics provides a growing set of modules to facilitate the most common tasks in proteomics data analysis.
http://pyteomics.readthedocs.io
Apache License 2.0
105 stars 34 forks source link

`spectrum_reference` should not be a int in `featureXML` #50

Closed hguturu closed 2 years ago

hguturu commented 2 years ago

https://github.com/levitsky/pyteomics/blob/34c87ac7198b7cff45cb46a4001345e87c6bb5a4/pyteomics/_schema_defaults.py#L272-L274

Based on the test cases at https://github.com/OpenMS/OpenMS/search?q=spectrum_reference, it looks like all instances of ('PeptideIdentification', 'spectrum_reference') and ('UnassignedPeptideIdentification', 'spectrum_reference') are strings not ints. Oddly when I comment these two lines out the issue is not resolved. Perhaps due to cache or its being caught elsewhere?

levitsky commented 2 years ago

Hi!

Two points:

  1. Search results you link to show multiple files, but they are mostly not featureXML, rather consensusXML. Those appear to have a different schema. I'm not sure the featurexml parser will parse them adequately. I can take this as a feature request to add a consensusXML parser, perhaps? (I'm not sure what the difference really is in terms of content as I haven't used these in practice.)
  2. The schema information is retrieved from the actual XML schema by default. The schema is accessed via the schema URL specified at the top of the actual XML file you are reading. What you edited are defaults, only used as a fallback if the schema specified in the file is not accessible. Since I generated those defaults from the most recent schema at the time, I suspect that the issue is still there in the schema.

So, what file are you having a problem with? Is it a consensusXML or a featureXML file, and what is the schema URL?

hguturu commented 2 years ago

Pardon the sloppy query. I think this is more appropriate - https://github.com/OpenMS/OpenMS/search?q=extension%3AfeatureXML+spectrum_reference. This shows two formats for the spectrum_reference even for featureXML. The one I have is of the form spectrum_reference="controllerType=0 controllerNumber=1 scan=5057".

  1. I am having trouble using featureXML, not consensusXML.
  2. Got it, that makes sense. Looking at the schema at https://github.com/OpenMS/OpenMS/blob/develop/share/OpenMS/SCHEMAS/FeatureXML_1_6.xsd all versions of featureXML seem to be typed as xs:unsignedInt.

Here is my full error originating at https://github.com/levitsky/pyteomics/blob/master/pyteomics/xml.py#L456:

Traceback (most recent call last):
  File "/Users/hguturu/miniconda3/lib/python3.9/site-packages/pyteomics/xml.py", line 456, in _get_info
    info[k] = a(v)
  File "/Users/hguturu/miniconda3/lib/python3.9/site-packages/pyteomics/xml.py", line 158, in convert_from
    return cls.str_to_num(s, t)
  File "/Users/hguturu/miniconda3/lib/python3.9/site-packages/pyteomics/xml.py", line 153, in str_to_num
    return numtype(s) if s else None
  File "/Users/hguturu/miniconda3/lib/python3.9/site-packages/pyteomics/auxiliary/structures.py", line 269, in __new__
    inst = int.__new__(cls, value)
ValueError: invalid literal for int() with base 10: 'controllerType=0 controllerNumber=1 scan=12235'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/Users/hguturu/export_openms_apexrt.py", line 58, in <module>
    main()
  File "/Users/hguturu/export_openms_apexrt.py", line 54, in main
    export_openms_apexrt(args.inputs, args.output)
  File "/Users/hguturu/export_openms_apexrt.py", line 35, in export_openms_apexrt
    for feature in pyteomics.openms.featurexml.read(open(input_fn, "rb")):
  File "/Users/hguturu/miniconda3/lib/python3.9/site-packages/pyteomics/auxiliary/file_helpers.py", line 176, in __next__
    return next(self._reader)
  File "/Users/hguturu/miniconda3/lib/python3.9/site-packages/pyteomics/xml.py", line 1261, in __next__
    return next(self._iterator)
  File "/Users/hguturu/miniconda3/lib/python3.9/site-packages/pyteomics/xml.py", line 586, in _iterfind_impl
    info = self._get_info_smart(child, **kwargs)
  File "/Users/hguturu/miniconda3/lib/python3.9/site-packages/pyteomics/openms/featurexml.py", line 55, in _get_info_smart
    info = self._get_info(element, **kw)
  File "/Users/hguturu/miniconda3/lib/python3.9/site-packages/pyteomics/xml.py", line 428, in _get_info
    self._get_info_smart(child, ename=cname, **kwargs))
  File "/Users/hguturu/miniconda3/lib/python3.9/site-packages/pyteomics/openms/featurexml.py", line 55, in _get_info_smart
    info = self._get_info(element, **kw)
  File "/Users/hguturu/miniconda3/lib/python3.9/site-packages/pyteomics/xml.py", line 461, in _get_info
    raise PyteomicsError(message)
pyteomics.auxiliary.structures.PyteomicsError: Pyteomics error, message: 'Error when converting types: ("invalid literal for int() with base 10: \'controllerType=0 controllerNumber=1 scan=12235\'",)'
levitsky commented 2 years ago

Thanks for the clarification. I can reproduce the error with the example files from your query. Looks like a workaround is needed for the incorrect type in the schema.

hguturu commented 2 years ago

Excellent. I also opened an issue upstream with OpenMS to see if they can update the schemas - https://github.com/OpenMS/OpenMS/issues/5478.

levitsky commented 2 years ago

By the way, as a temporary workaround, you should be able to get the parser to work by commenting out the lines in _schema_defaults and instantiating the parser with read_schema=False.

mobiusklein commented 2 years ago

You should be able to fix it at run time too, remove the offending keys from _featuerxml_schema_defaults, and pass read_schema=False when creating the FeatureXML object/calling featurexml.read.

from pyteomics.openms import featurexml
featurexml.FeatureXML._default_schema['ints'].remove(
    ('PeptideIdentification', 'spectrum_reference'))
featurexml.FeatureXML._default_schema['ints'].remove(
    ('UnassignedPeptideIdentification', 'spectrum_reference'))

featurexml.read(path, read_schema=False)
hguturu commented 2 years ago

The run time fix is great since that way I don't have to edit the source. I did find that I also needed to add the following since the default schema had the type wrong.

pyteomics.openms.featurexml.FeatureXML._default_schema["ints"].remove(
    ("quality", "quality")
)
pyteomics.openms.featurexml.FeatureXML._default_schema["floats"].add(
    ("quality", "quality")
)

Looks like quality is a double in https://github.com/OpenMS/OpenMS/blob/d9692da0d410c06b6cdc960f608a5c962360d09c/share/OpenMS/SCHEMAS/FeatureXML_1_6.xsd. My xml reading isn't great, but the min/maxOccurs makes me think it might even have to be a floatlists, but float worked for my test case.

levitsky commented 2 years ago

This was fixed in the upstream schema and, before that, worked around in #53. Closing now, feel free to follow up.