levitsky / pyteomics

Pyteomics is a collection of lightweight and handy tools for Python that help to handle various sorts of proteomics data. Pyteomics provides a growing set of modules to facilitate the most common tasks in proteomics data analysis.
http://pyteomics.readthedocs.io
Apache License 2.0
105 stars 34 forks source link

Incorrect resolving of modification names with semicolon by UnimodModification #100

Closed caetera closed 1 year ago

caetera commented 1 year ago

Some of the modifications in Unimod contain a semicolon in their name, when UnimodModification is created with the modification name, the part before the semicolon is ignored during modification resolution, and, thus, sometimes resolved into a wrong modification. The issue is, likely, similar to #98, i.e. the first part is treated as a prefix and, thus, not used for matching.

Please see some examples below (Python 3.8.10 & pyteomics 4.5.6)

In [1]: from pyteomics.proforma import UnimodModification

In [2]: UnimodModification('UNIMOD:950').composition
Out[2]: Composition({'H': -1, 'Li': 1})

In [3]: UnimodModification('Cation:Li').composition
/home/vgor/.local/lib/python3.8/site-packages/pyteomics/proforma.py:342: UserWarning: Multiple matches found for 'Li' in Unimod, taking the first, 42.
  warnings.warn(
Out[3]: Composition({'H': 12, 'C': 8, 'O': 1, 'S': 2})

In [4]: UnimodModification('UNIMOD:530').composition
Out[4]: Composition({'H': -1, 'K': 1})

In [5]: UnimodModification('Cation:K').composition
/home/vgor/.local/lib/python3.8/site-packages/pyteomics/proforma.py:342: UserWarning: Multiple matches found for 'K' in Unimod, taking the first, 351.
  warnings.warn(
Out[5]: Composition({'C': -1, 'O': 1})

In [6]: UnimodModification('UNIMOD:530').name
Out[6]: 'Cation:K'

In [7]: UnimodModification('UNIMOD:950').name
Out[7]: 'Cation:Li'
mobiusklein commented 1 year ago

So Cation:Na works (what I tested against), but Cation:Li and Cation:K don't because of ambiguity in Unimod. ModificationBase._parse_identifier() is stripping Cation: off the names. I see a path to fixing this, but I'll not implement it at 10 PM because that didn't work so well for testing yesterday.

caetera commented 1 year ago

Hi @mobiusklein, Thank you for taking care of it. Just want to mention that there are (quite some) other modifications (i.e. not only Cation:XXX) in Unimod having a semicolon in the name and parsing incorrectly. I have looked through the complete Unimod (as provided by pyteomics), please, see below.

Building DataFrame with all modification and their composition resolved by modification id or by modification name

import pandas as pd
from pyteomics.proforma import Unimod, UnimodModification

unimod = Unimod()

def composition_to_string(composition):
    return ''.join(['{}{}'.format(element, composition[element]) for element in sorted(composition.keys())])

records = []

for modification in unimod.mods:
    try:
        mod = UnimodModification(f'UNIMOD:{modification["record_id"]}')
        composition_by_id = mod.composition
        composition_by_name = UnimodModification(mod.name).composition
        record = {'id': modification["record_id"],
                  'name': mod.name,
                  'composition_by_id': composition_to_string(composition_by_id),
                  'composition_by_name': composition_to_string(composition_by_name),
                  'parsed': True}
        if composition_by_id != composition_by_name:
            record['match'] = False
        else:
            record['match'] = True

        records.append(record)

    except Exception:
        records.append({'id': modification["record_id"],
                       'name': mod.name,
                       'parsed': False})

unimod_modifications = pd.DataFrame(records)
unimod_modifications['has_semicolon'] = unimod_modifications['name'].str.find(':') != -1

Some modifications were throwing an exception during the processing, but since all of them are Unknown:NNN, I don't think these are crucial. I am actually surprised these are in Unimod. IMO, these have to be presented as mass modification in a wild.

unimod_modifications.groupby(['has_semicolon', 'parsed'])['id'].count()

has_semicolon  parsed
False          True      1298
True           False        7
               True       208
Name: id, dtype: int64

unimod_modifications.query('parsed == False')['name'].head(10)

1464    Unknown:177
1465    Unknown:210
1466    Unknown:216
1467    Unknown:234
1468    Unknown:248
1469    Unknown:250
1471    Unknown:306
Name: name, dtype: object

There are, however, a number of other modifications that are parsed incorrectly

unimod_modifications.groupby(['has_semicolon', 'match'])['id'].count()

has_semicolon  match
False          True     1298
True           False      52
               True      156
Name: id, dtype: int64
unimod_modifications.query('match == False').head(10)
id name composition_by_id composition_by_name parsed match has_semicolon
12 ICAT-D:2H(8) C20H26H[2]8N4O5S1 C22H30H[2]8N4O6S1 True False True
61 GIST-Quat:2H(3) C7H10H[2]3N1O1 C2H-1H[2]3O1 True False True
95 IMID:2H(4) C3H[2]4N2 C4H[2]4O3 True False True
97 Propionamide:2H(3) C3H2H[2]3N1O1 C2H-1H[2]3O1 True False True
530 Cation:K H-1K1 C-1O1 True False True
171 NBS:13C(6) C[13]6H3N1O2S1 C9C[13]6Cl1H20N1O6 True False True
184 Label:13C(9) C-9C[13]9 C1C[13]9H17N3O3 True False True
188 Label:13C(6) C-6C[13]6 C9C[13]6Cl1H20N1O6 True False True
196 QAT:2H(3) C9H16H[2]3N2O1 C2H-1H[2]3O1 True False True
199 Dimethyl:2H(4) C2H[2]4 C4H[2]4O3 True False True

The complete table can be downloaded from https://syddanskuni-my.sharepoint.com/:x:/g/personal/vgor_bmb_sdu_dk/ERnNZxpJPYRFgwN76NWnxeEBUE0nGmUNshSQg7XP5dlYqA?e=HPrii8