Closed caetera closed 1 year ago
So Cation:Na
works (what I tested against), but Cation:Li
and Cation:K
don't because of ambiguity in Unimod. ModificationBase._parse_identifier()
is stripping Cation:
off the names. I see a path to fixing this, but I'll not implement it at 10 PM because that didn't work so well for testing yesterday.
Hi @mobiusklein,
Thank you for taking care of it. Just want to mention that there are (quite some) other modifications (i.e. not only Cation:XXX
) in Unimod having a semicolon in the name and parsing incorrectly. I have looked through the complete Unimod (as provided by pyteomics
), please, see below.
Building DataFrame with all modification and their composition resolved by modification id
or by modification name
import pandas as pd
from pyteomics.proforma import Unimod, UnimodModification
unimod = Unimod()
def composition_to_string(composition):
return ''.join(['{}{}'.format(element, composition[element]) for element in sorted(composition.keys())])
records = []
for modification in unimod.mods:
try:
mod = UnimodModification(f'UNIMOD:{modification["record_id"]}')
composition_by_id = mod.composition
composition_by_name = UnimodModification(mod.name).composition
record = {'id': modification["record_id"],
'name': mod.name,
'composition_by_id': composition_to_string(composition_by_id),
'composition_by_name': composition_to_string(composition_by_name),
'parsed': True}
if composition_by_id != composition_by_name:
record['match'] = False
else:
record['match'] = True
records.append(record)
except Exception:
records.append({'id': modification["record_id"],
'name': mod.name,
'parsed': False})
unimod_modifications = pd.DataFrame(records)
unimod_modifications['has_semicolon'] = unimod_modifications['name'].str.find(':') != -1
Some modifications were throwing an exception during the processing, but since all of them are Unknown:NNN
, I don't think these are crucial. I am actually surprised these are in Unimod. IMO, these have to be presented as mass modification in a wild.
unimod_modifications.groupby(['has_semicolon', 'parsed'])['id'].count()
has_semicolon parsed
False True 1298
True False 7
True 208
Name: id, dtype: int64
unimod_modifications.query('parsed == False')['name'].head(10)
1464 Unknown:177
1465 Unknown:210
1466 Unknown:216
1467 Unknown:234
1468 Unknown:248
1469 Unknown:250
1471 Unknown:306
Name: name, dtype: object
There are, however, a number of other modifications that are parsed incorrectly
unimod_modifications.groupby(['has_semicolon', 'match'])['id'].count()
has_semicolon match
False True 1298
True False 52
True 156
Name: id, dtype: int64
unimod_modifications.query('match == False').head(10)
id | name | composition_by_id | composition_by_name | parsed | match | has_semicolon |
---|---|---|---|---|---|---|
12 | ICAT-D:2H(8) | C20H26H[2]8N4O5S1 | C22H30H[2]8N4O6S1 | True | False | True |
61 | GIST-Quat:2H(3) | C7H10H[2]3N1O1 | C2H-1H[2]3O1 | True | False | True |
95 | IMID:2H(4) | C3H[2]4N2 | C4H[2]4O3 | True | False | True |
97 | Propionamide:2H(3) | C3H2H[2]3N1O1 | C2H-1H[2]3O1 | True | False | True |
530 | Cation:K | H-1K1 | C-1O1 | True | False | True |
171 | NBS:13C(6) | C[13]6H3N1O2S1 | C9C[13]6Cl1H20N1O6 | True | False | True |
184 | Label:13C(9) | C-9C[13]9 | C1C[13]9H17N3O3 | True | False | True |
188 | Label:13C(6) | C-6C[13]6 | C9C[13]6Cl1H20N1O6 | True | False | True |
196 | QAT:2H(3) | C9H16H[2]3N2O1 | C2H-1H[2]3O1 | True | False | True |
199 | Dimethyl:2H(4) | C2H[2]4 | C4H[2]4O3 | True | False | True |
The complete table can be downloaded from https://syddanskuni-my.sharepoint.com/:x:/g/personal/vgor_bmb_sdu_dk/ERnNZxpJPYRFgwN76NWnxeEBUE0nGmUNshSQg7XP5dlYqA?e=HPrii8
Some of the modifications in Unimod contain a semicolon in their name, when
UnimodModification
is created with the modification name, the part before the semicolon is ignored during modification resolution, and, thus, sometimes resolved into a wrong modification. The issue is, likely, similar to #98, i.e. the first part is treated as a prefix and, thus, not used for matching.Please see some examples below (Python 3.8.10 & pyteomics 4.5.6)