rformassspectrometry / CompoundDb

Creating and using (chemical) compound databases
https://rformassspectrometry.github.io/CompoundDb/index.html
17 stars 16 forks source link

Re-define compound table #35

Open jorainer opened 5 years ago

jorainer commented 5 years ago

The purpose of the compound table: 1) contain a unique entry for one compound 2) allow to group e.g. multiple MS2 spectra to a single entity.

The question however is how to define a compound. What is a compound? An entity with its unique, own InChI? Structure == compound?

For the HMDB database it was pretty straight forward as HMDB provides compound identifiers. MoNa (issue #23)and Massbank (issue #34) however are more complicated as they don't allow to unify the data.

What we should do:

jorainer commented 5 years ago

For ChEBI (2018-12-03):

So, we don't have an InChI for all of them and we have compounds with the same InChI! Apart from the name and the ID these compounds are however identical:

      compound_id                            compound_name
8564  CHEBI:17775   7,9-dihydro-1H-purine-2,6,8(3H)-trione
18506 CHEBI:46811 2,6-dihydroxy-7,9-dihydro-8H-purin-8-one
18507 CHEBI:46814                    9H-purine-2,6,8-triol
18509 CHEBI:46817                    7H-purine-2,6,8-triol
18513 CHEBI:46823                    1H-purine-2,6,8-triol
27249 CHEBI:62589     6-hydroxy-1H-purine-2,8(7H,9H)-dione
                                                                         inchi
8564  InChI=1S/C5H4N4O3/c10-3-1-2(7-4(11)6-1)8-5(12)9-3/h(H4,6,7,8,9,10,11,12)
18506 InChI=1S/C5H4N4O3/c10-3-1-2(7-4(11)6-1)8-5(12)9-3/h(H4,6,7,8,9,10,11,12)
18507 InChI=1S/C5H4N4O3/c10-3-1-2(7-4(11)6-1)8-5(12)9-3/h(H4,6,7,8,9,10,11,12)
18509 InChI=1S/C5H4N4O3/c10-3-1-2(7-4(11)6-1)8-5(12)9-3/h(H4,6,7,8,9,10,11,12)
18513 InChI=1S/C5H4N4O3/c10-3-1-2(7-4(11)6-1)8-5(12)9-3/h(H4,6,7,8,9,10,11,12)
27249 InChI=1S/C5H4N4O3/c10-3-1-2(7-4(11)6-1)8-5(12)9-3/h(H4,6,7,8,9,10,11,12)
                        inchi_key  formula    mass
8564  LEHOTFFKMJEONL-UHFFFAOYSA-N C5H4N4O3 168.028
18506 LEHOTFFKMJEONL-UHFFFAOYSA-N C5H4N4O3 168.028
18507 LEHOTFFKMJEONL-UHFFFAOYSA-N C5H4N4O3 168.028
18509 LEHOTFFKMJEONL-UHFFFAOYSA-N C5H4N4O3 168.028
18513 LEHOTFFKMJEONL-UHFFFAOYSA-N C5H4N4O3 168.028
27249 LEHOTFFKMJEONL-UHFFFAOYSA-N C5H4N4O3 168.028
> 

Question is whether these compounds would have different MS2 spectra? If so it would not make sense to combine them!

Some of the compounds without an inchi are listed below:

     compound_id            compound_name inchi inchi_key
3    CHEBI:10003     ribostamycin sulfate  <NA>      <NA>
15   CHEBI:10036                wax ester  <NA>      <NA>
91   CHEBI:10283     2-hydroxy fatty acid  <NA>      <NA>
140  CHEBI:10545                 electron  <NA>      <NA>
148  CHEBI:10583        kappa-carrageenan  <NA>      <NA>
154 CHEBI:106304 sphingomyelin d18:1/16:0  <NA>      <NA>
                     formula    mass
3       C17H34N4O10.(H2O4S)n      NA
15                     CO2R2  43.990
91  C2H3O3R __ C2H3O3R(CH2)n  75.008
140                     <NA>   0.000
148            (C12H17O12S)n      NA
154              C39H79N2O6P 702.568
SiggiSmara commented 5 years ago

In the case of CHEBI:46814 and CHEBI:46817 for instance (and I suspect the rest of them) then they are not the same chemical at first glance (see below, different locations of a hydrogen), but in fact they are tautomers of each other. This is also indicated in the CHEBI entries of some of them if you look them up in CHEBI. That means they readily convert from one to the other without any external input (energy or otherwise) and thus should really be thought of as a mixture of all of them. The MS2 spectrum "should" be similar if not identical, buut the actualy ionization conditions (pH, buffer ions etc) might also have a big effect leading to different MS2 spectra.

Here I would suggest to get input from people that are actually working with tautomers to hear what they have to say about it.

46814 46814

and 46817 46817

jorainer commented 5 years ago

Thanks for your input @SiggiSmara ! I'll try to get some input from people actually working with MS2 spectra and identification.

stanstrup commented 5 years ago

I have no experience with tautomers but one option could be to use the SMILES where this is explicit. You can also generate a non-standard InChI with the fixed-H layer from the SMILES.

jorainer commented 5 years ago

Had also feedback from Steffen. They use the same approach than pubchem: a compound table with unique InChI and a substance table with additional annotations (eventually multiple entries per compound).