rformassspectrometry / unimod

Amino acid modifications for mass spectrometry
6 stars 0 forks source link

How to store chemical elements and aminoacids? #1

Closed sgibb closed 6 years ago

sgibb commented 7 years ago

The unimod.xml file provides mass information about most chemical elements and aminoacids. The unimod package is already able to parse these information into data.frames:

name avgMass monoMass
H Hydrogen 1.007940 1.007825
2H Deuterium 2.014102 2.014102
Li Lithium 6.941000 7.016003
C Carbon 12.010700 12.000000
13C Carbon13 13.003355 13.003355
N Nitrogen 14.006700 14.003074
15N Nitrogen15 15.000109 15.000109
O Oxygen 15.999400 15.994915
18O Oxygen18 17.999160 17.999160
F Fluorine 18.998403 18.998403
Na Sodium 22.989770 22.989768
P Phosphorous 30.973761 30.973762
S Sulfur 32.065000 31.972071
Cl Chlorine 35.453000 34.968853
K Potassium 39.098300 38.963707
Ca Calcium 40.078000 39.962591
Fe Iron 55.845000 55.934939
Ni Nickel 58.693400 57.935346
Zn Zinc 65.409000 63.929145
Se Selenium 78.960000 79.916520
Br Bromine 79.904000 78.918336
Ag Silver 107.868200 106.905092
Hg Mercury 200.590000 201.970617
Au Gold 196.966550 196.966543
I Iodine 126.904470 126.904473
Mo Molybdenum 95.940000 97.905407
Cu Copper 63.546000 62.929599
e electron 0.000549 0.000549
B Boron 10.811000 11.009305
As Arsenic 74.921594 74.921594
Cd Cadmium 112.411000 113.903357
Cr Chromium 51.996100 51.940510
Co Cobalt 58.933195 58.933198
Mn Manganese 54.938045 54.938047
Mg Magnesium 24.305000 23.985042
Pd Palladium 106.420000 105.903478
threeLetter name avgMass monoMass H C N O S Se
- 0.0000 0.000000 0 0 0 0 0 0
A Ala Alanine 71.0779 71.037114 5 3 1 1 0 0
R Arg Arginine 156.1857 156.101111 12 6 4 1 0 0
N Asn Asparagine 114.1026 114.042927 6 4 2 2 0 0
D Asp Aspartic acid 115.0874 115.026943 5 4 1 3 0 0
C Cys Cysteine 103.1429 103.009185 5 3 1 1 1 0
E Glu Glutamic acid 129.1140 129.042593 7 5 1 3 0 0
Q Gln Glutamine 128.1292 128.058578 8 5 2 2 0 0
G Gly Glycine 57.0513 57.021464 3 2 1 1 0 0
H His Histidine 137.1393 137.058912 7 6 3 1 0 0
I Ile Isoleucine 113.1576 113.084064 11 6 1 1 0 0
L Leu Leucine 113.1576 113.084064 11 6 1 1 0 0
K Lys Lysine 128.1723 128.094963 12 6 2 1 0 0
M Met Methionine 131.1961 131.040485 9 5 1 1 1 0
F Phe Phenylalanine 147.1739 147.068414 9 9 1 1 0 0
P Pro Proline 97.1152 97.052764 7 5 1 1 0 0
S Ser Serine 87.0773 87.032028 5 3 1 2 0 0
T Thr Threonine 101.1039 101.047679 7 4 1 2 0 0
W Trp Tryptophan 186.2099 186.079313 10 11 2 1 0 0
Y Tyr Tyrosine 163.1733 163.063329 9 9 1 2 0 0
V Val Valine 99.1311 99.068414 9 5 1 1 0 0
N-term N-term N-term 1.0079 1.007825 1 0 0 0 0 0
C-term C-term C-term 17.0073 17.002740 1 1 0 0 0 0
U Sec Selenocysteine 150.0379 150.953633 5 3 1 1 1 0

How could we store these data.frames in the package that they could easily be used by other packages, namely. MSnbase and Pbase:

I could store them as .RData in so that they could be loaded by data(aminoacids). Any other suggestions?

lgatto commented 7 years ago

What will be the intended use of these data? If one usecase is for users to manipulate it, I would suggest to convert it to a tibble. I think that the data(aminoacids) is a good way to distribute that data.

I think it would be good for MSnbase and Pbase to use the data from unimod. That will require for unimod to be in Bioconductor first.

lgatto commented 7 years ago

That will require for unimod to be in Bioconductor first.

Just saw the Bioconductor milestone now.

sgibb commented 7 years ago

IMHO there is no need to allow modification by the user. Information about chemical elements and amino acids are constants that are rarely changed. Nevertheless a data.frame that is accessible via data(elements) and data(aminoacids) would be ok. Is there any specific reason to use a tibble. Isn't it just a data.frame with a different constructor/different defaults, fancy printing and modified subset methods? Shouldn't we stay as base as possible?

lgatto commented 7 years ago

Shouldn't we stay as base as possible?

If there is a lot of manipulation to be expected and described in the vignette/manuals, then I think it's possibly worth using a tibble and the tidyverse. Otherwise yes, let's stick to data.frames - users can always convert it if they see fit.