How to store chemical elements and aminoacids?

sgibb commented 7 years ago

The unimod.xml file provides mass information about most chemical elements and aminoacids. The unimod package is already able to parse these information into data.frames:

	name	avgMass	monoMass
H	Hydrogen	1.007940	1.007825
2H	Deuterium	2.014102	2.014102
Li	Lithium	6.941000	7.016003
C	Carbon	12.010700	12.000000
13C	Carbon13	13.003355	13.003355
N	Nitrogen	14.006700	14.003074
15N	Nitrogen15	15.000109	15.000109
O	Oxygen	15.999400	15.994915
18O	Oxygen18	17.999160	17.999160
F	Fluorine	18.998403	18.998403
Na	Sodium	22.989770	22.989768
P	Phosphorous	30.973761	30.973762
S	Sulfur	32.065000	31.972071
Cl	Chlorine	35.453000	34.968853
K	Potassium	39.098300	38.963707
Ca	Calcium	40.078000	39.962591
Fe	Iron	55.845000	55.934939
Ni	Nickel	58.693400	57.935346
Zn	Zinc	65.409000	63.929145
Se	Selenium	78.960000	79.916520
Br	Bromine	79.904000	78.918336
Ag	Silver	107.868200	106.905092
Hg	Mercury	200.590000	201.970617
Au	Gold	196.966550	196.966543
I	Iodine	126.904470	126.904473
Mo	Molybdenum	95.940000	97.905407
Cu	Copper	63.546000	62.929599
e	electron	0.000549	0.000549
B	Boron	10.811000	11.009305
As	Arsenic	74.921594	74.921594
Cd	Cadmium	112.411000	113.903357
Cr	Chromium	51.996100	51.940510
Co	Cobalt	58.933195	58.933198
Mn	Manganese	54.938045	54.938047
Mg	Magnesium	24.305000	23.985042
Pd	Palladium	106.420000	105.903478

	threeLetter	name	avgMass	monoMass	H	C	N	O	S
-			0.0000	0.000000	0	0	0	0	0
A	Ala	Alanine	71.0779	71.037114	5	3	1	1	0
R	Arg	Arginine	156.1857	156.101111	12	6	4	1	0
N	Asn	Asparagine	114.1026	114.042927	6	4	2	2	0
D	Asp	Aspartic acid	115.0874	115.026943	5	4	1	3	0
C	Cys	Cysteine	103.1429	103.009185	5	3	1	1	1
E	Glu	Glutamic acid	129.1140	129.042593	7	5	1	3	0
Q	Gln	Glutamine	128.1292	128.058578	8	5	2	2	0
G	Gly	Glycine	57.0513	57.021464	3	2	1	1	0
H	His	Histidine	137.1393	137.058912	7	6	3	1	0
I	Ile	Isoleucine	113.1576	113.084064	11	6	1	1	0
L	Leu	Leucine	113.1576	113.084064	11	6	1	1	0
K	Lys	Lysine	128.1723	128.094963	12	6	2	1	0
M	Met	Methionine	131.1961	131.040485	9	5	1	1	1
F	Phe	Phenylalanine	147.1739	147.068414	9	9	1	1	0
P	Pro	Proline	97.1152	97.052764	7	5	1	1	0
S	Ser	Serine	87.0773	87.032028	5	3	1	2	0
T	Thr	Threonine	101.1039	101.047679	7	4	1	2	0
W	Trp	Tryptophan	186.2099	186.079313	10	11	2	1	0
Y	Tyr	Tyrosine	163.1733	163.063329	9	9	1	2	0
V	Val	Valine	99.1311	99.068414	9	5	1	1	0
N-term	N-term	N-term	1.0079	1.007825	1	0	0	0	0
C-term	C-term	C-term	17.0073	17.002740	1	1	0	0	0
U	Sec	Selenocysteine	150.0379	150.953633	5	3	1	1	1

How could we store these data.frames in the package that they could easily be used by other packages, namely. MSnbase and Pbase:

MSnbase: the environments in https://github.com/lgatto/MSnbase/blob/master/R/environment.R could be replaced.
Pbase: uses the environments provides by MSnbase. Pbase could be completely independent of MSnbase (and depend on unimod instead).

I could store them as .RData in so that they could be loaded by data(aminoacids). Any other suggestions?

lgatto commented 7 years ago

What will be the intended use of these data? If one usecase is for users to manipulate it, I would suggest to convert it to a tibble. I think that the data(aminoacids) is a good way to distribute that data.

I think it would be good for MSnbase and Pbase to use the data from unimod. That will require for unimod to be in Bioconductor first.

lgatto commented 7 years ago

That will require for unimod to be in Bioconductor first.

Just saw the Bioconductor milestone now.

sgibb commented 7 years ago

IMHO there is no need to allow modification by the user. Information about chemical elements and amino acids are constants that are rarely changed. Nevertheless a data.frame that is accessible via data(elements) and data(aminoacids) would be ok. Is there any specific reason to use a tibble. Isn't it just a data.frame with a different constructor/different defaults, fancy printing and modified subset methods? Shouldn't we stay as base as possible?

lgatto commented 7 years ago

Shouldn't we stay as base as possible?

If there is a lot of manipulation to be expected and described in the vignette/manuals, then I think it's possibly worth using a tibble and the tidyverse. Otherwise yes, let's stick to data.frames - users can always convert it if they see fit.

rformassspectrometry / unimod

How to store chemical elements and aminoacids? #1