levitsky / pyteomics

Pyteomics is a collection of lightweight and handy tools for Python that help to handle various sorts of proteomics data. Pyteomics provides a growing set of modules to facilitate the most common tasks in proteomics data analysis.
http://pyteomics.readthedocs.io
Apache License 2.0
105 stars 34 forks source link

[proforma] Unresolvable modification results in non-descriptive error when accessing `.mass` property #68

Closed RalfG closed 2 years ago

RalfG commented 2 years ago

Hi @levitsky and @mobiusklein,

As the title says, in the proforma module, when trying to access the .mass property on an unresolvable modification, the following error is thrown:

parsed_sequence, modifiers = proforma.parse("D[TMT6plex]PEPTIDK/2")
modification = parsed_sequence[0][1][0]
modification.mass

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
/tmp/ipykernel_767612/1789684910.py in <module>
      1 parsed_sequence, modifiers = proforma.parse("D[TMT6plex]PEPTIDK/2")
      2 modification = parsed_sequence[0][1][0]
----> 3 modification.mass

~/anaconda3/envs/py37/lib/python3.7/site-packages/pyteomics/proforma.py in mass(self)
    687         -------float
    688         '''
--> 689         return self.definition['mass']
    690 
    691     @property

~/anaconda3/envs/py37/lib/python3.7/site-packages/pyteomics/proforma.py in definition(self)
    677         '''
    678         if self._definition is None:
--> 679             self._definition = self.resolve()
    680         return self._definition
    681 

~/anaconda3/envs/py37/lib/python3.7/site-packages/pyteomics/proforma.py in resolve(self)
    932         '''
    933         keys = self._parse_identifier()
--> 934         defn = self.resolver(*keys)
    935         if defn is not None:
    936             return defn

~/anaconda3/envs/py37/lib/python3.7/site-packages/pyteomics/proforma.py in __call__(self, name, id, **kwargs)
    380 
    381     def __call__(self, name=None, id=None, **kwargs):
--> 382         return self.resolve(name, id, **kwargs)
    383 
    384     def __eq__(self, other):

~/anaconda3/envs/py37/lib/python3.7/site-packages/pyteomics/proforma.py in resolve(self, name, id, **kwargs)
    617         for resolver in self.resolvers:
    618             try:
--> 619                 defn = resolver(name=name, id=id, **kwargs)
    620             except (KeyError):
    621                 continue

~/anaconda3/envs/py37/lib/python3.7/site-packages/pyteomics/proforma.py in __call__(self, name, id, **kwargs)
    380 
    381     def __call__(self, name=None, id=None, **kwargs):
--> 382         return self.resolve(name, id, **kwargs)
    383 
    384     def __eq__(self, other):

~/anaconda3/envs/py37/lib/python3.7/site-packages/pyteomics/proforma.py in resolve(self, name, id, **kwargs)
    449         else:
    450             raise ValueError("Must provide one of `name` or `id`")
--> 451         mass = float(defn.DiffMono)
    452         diff_formula = str(defn.DiffFormula.strip().replace(" ", ''))
    453         composition = Composition(diff_formula)

TypeError: float() argument must be a string or a number, not 'NoneType'

Adding the Unimod prefix U: fixes the issue, in this case:

parsed_sequence, modifiers = proforma.parse("D[U:TMT6plex]PEPTIDK/2")
modification = parsed_sequence[0][1][0]
modification.mass

>>> 229.162932

It is of course expected that the mass of an unresolvable modification cannot be retrieved, although I would propose to raise a more informative exception.

As a side question: How come that Oxidation is parsed without prefix, while TMT6plex is not?

Thanks! Ralf

mobiusklein commented 2 years ago

Two issues are appearing here. The first is that there's a missing break in the GenericResolver.resolve method. That's an easy fix.

I'll also add an explicit exception stating what went wrong

The second is that TMT6plex is found in both UNIMOD and PSI-MOD, and the entry in PSI-MOD is connected by synonym to many different entries, but the one that sticks is the reporter ion, which isn't really a modification but an ion series, unlike most other entries in the CV, so it didn't have the shape I was expecting.

Since GenericResolver.resolve didn't short-circuit, it kept trying to find a modification in other CVs. Since PSIMOD is checked after UNIMOD, but also contains a match for that name, but that match was problematic (see paragraph above).

Oxidation isn't found in any other CV, so it was safe from the lack-of-short-circuit problem before.

Fix is #69