Incorrect resolving of modification names with semicolon by UnimodModification

caetera commented 1 year ago

Some of the modifications in Unimod contain a semicolon in their name, when UnimodModification is created with the modification name, the part before the semicolon is ignored during modification resolution, and, thus, sometimes resolved into a wrong modification. The issue is, likely, similar to #98, i.e. the first part is treated as a prefix and, thus, not used for matching.

Please see some examples below (Python 3.8.10 & pyteomics 4.5.6)

In [1]: from pyteomics.proforma import UnimodModification

In [2]: UnimodModification('UNIMOD:950').composition
Out[2]: Composition({'H': -1, 'Li': 1})

In [3]: UnimodModification('Cation:Li').composition
/home/vgor/.local/lib/python3.8/site-packages/pyteomics/proforma.py:342: UserWarning: Multiple matches found for 'Li' in Unimod, taking the first, 42.
  warnings.warn(
Out[3]: Composition({'H': 12, 'C': 8, 'O': 1, 'S': 2})

In [4]: UnimodModification('UNIMOD:530').composition
Out[4]: Composition({'H': -1, 'K': 1})

In [5]: UnimodModification('Cation:K').composition
/home/vgor/.local/lib/python3.8/site-packages/pyteomics/proforma.py:342: UserWarning: Multiple matches found for 'K' in Unimod, taking the first, 351.
  warnings.warn(
Out[5]: Composition({'C': -1, 'O': 1})

In [6]: UnimodModification('UNIMOD:530').name
Out[6]: 'Cation:K'

In [7]: UnimodModification('UNIMOD:950').name
Out[7]: 'Cation:Li'

mobiusklein commented 1 year ago

So Cation:Na works (what I tested against), but Cation:Li and Cation:K don't because of ambiguity in Unimod. ModificationBase._parse_identifier() is stripping Cation: off the names. I see a path to fixing this, but I'll not implement it at 10 PM because that didn't work so well for testing yesterday.

caetera commented 1 year ago

Hi @mobiusklein, Thank you for taking care of it. Just want to mention that there are (quite some) other modifications (i.e. not only Cation:XXX) in Unimod having a semicolon in the name and parsing incorrectly. I have looked through the complete Unimod (as provided by pyteomics), please, see below.

Building DataFrame with all modification and their composition resolved by modification id or by modification name

import pandas as pd
from pyteomics.proforma import Unimod, UnimodModification

unimod = Unimod()

def composition_to_string(composition):
    return ''.join(['{}{}'.format(element, composition[element]) for element in sorted(composition.keys())])

records = []

for modification in unimod.mods:
    try:
        mod = UnimodModification(f'UNIMOD:{modification["record_id"]}')
        composition_by_id = mod.composition
        composition_by_name = UnimodModification(mod.name).composition
        record = {'id': modification["record_id"],
                  'name': mod.name,
                  'composition_by_id': composition_to_string(composition_by_id),
                  'composition_by_name': composition_to_string(composition_by_name),
                  'parsed': True}
        if composition_by_id != composition_by_name:
            record['match'] = False
        else:
            record['match'] = True

        records.append(record)

    except Exception:
        records.append({'id': modification["record_id"],
                       'name': mod.name,
                       'parsed': False})

unimod_modifications = pd.DataFrame(records)
unimod_modifications['has_semicolon'] = unimod_modifications['name'].str.find(':') != -1

Some modifications were throwing an exception during the processing, but since all of them are Unknown:NNN, I don't think these are crucial. I am actually surprised these are in Unimod. IMO, these have to be presented as mass modification in a wild.

unimod_modifications.groupby(['has_semicolon', 'parsed'])['id'].count()

has_semicolon  parsed
False          True      1298
True           False        7
               True       208
Name: id, dtype: int64

unimod_modifications.query('parsed == False')['name'].head(10)

1464    Unknown:177
1465    Unknown:210
1466    Unknown:216
1467    Unknown:234
1468    Unknown:248
1469    Unknown:250
1471    Unknown:306
Name: name, dtype: object

There are, however, a number of other modifications that are parsed incorrectly

unimod_modifications.groupby(['has_semicolon', 'match'])['id'].count()

has_semicolon  match
False          True     1298
True           False      52
               True      156
Name: id, dtype: int64

unimod_modifications.query('match == False').head(10)

id	name	composition_by_id	composition_by_name	parsed	match	has_semicolon
12	ICAT-D:2H(8)	C20H26H[2]8N4O5S1	C22H30H[2]8N4O6S1	True	False	True
61	GIST-Quat:2H(3)	C7H10H[2]3N1O1	C2H-1H[2]3O1	True	False	True
95	IMID:2H(4)	C3H[2]4N2	C4H[2]4O3	True	False	True
97	Propionamide:2H(3)	C3H2H[2]3N1O1	C2H-1H[2]3O1	True	False	True
530	Cation:K	H-1K1	C-1O1	True	False	True
171	NBS:13C(6)	C[13]6H3N1O2S1	C9C[13]6Cl1H20N1O6	True	False	True
184	Label:13C(9)	C-9C[13]9	C1C[13]9H17N3O3	True	False	True
188	Label:13C(6)	C-6C[13]6	C9C[13]6Cl1H20N1O6	True	False	True
196	QAT:2H(3)	C9H16H[2]3N2O1	C2H-1H[2]3O1	True	False	True
199	Dimethyl:2H(4)	C2H[2]4	C4H[2]4O3	True	False	True

The complete table can be downloaded from https://syddanskuni-my.sharepoint.com/:x:/g/personal/vgor_bmb_sdu_dk/ERnNZxpJPYRFgwN76NWnxeEBUE0nGmUNshSQg7XP5dlYqA?e=HPrii8

levitsky / pyteomics

Incorrect resolving of modification names with semicolon by UnimodModification #100