Import data from Massbank

rformassspectrometry / CompoundDb

Creating and using (chemical) compound databases

https://rformassspectrometry.github.io/CompoundDb/index.html

17 stars 16 forks source link

Import data from Massbank #34

Open jorainer opened 5 years ago

jorainer commented 5 years ago

Import open data from Massbank (https://github.com/MassBank/MassBank-data).

Seems the data is in nicely structured txt files, so import should be straight forward.

jorainer commented 5 years ago

Also here (similar to MoNa) we will run into the issue to reduce redundancies in the compound table. From a first look it seems we can however use "CH$LINK" entries, e.g. providing PubChem identifiers to define unique compounds and link the MS2 spectra to those.

jorainer commented 5 years ago

Compound information we can extract with the corresponding field name:

compound_id: no explicit compound name here, but we could use one of the external database links.
compound_name: we can have multiple "CH$NAME: " - use one here, others down for synonyms.
inchi: "CH$IUPAC: ".
formula: "CH$FORMULA: ".
mass: "CH$EXACT_MASS: ".
synonyms: "CH$NAME: "

Additional fields we might want to get:

inchi_key: "CH$LINK: INCHIKEY ".
additional identifiers: "CH$LINK: CHEBI ", "CH$LINK: KEGG ", "CH$LINK: PUBCHEM ", "CH$LINK: CHEMSPIDER ".

We could use an self-generated identifier and collapse entries with the same identifier based on either of the ones above.

Next we need to read the full data to check how to best reduce the information:

does every entry have an inchi key?
is there an external identifier present in all entries, e.g. PubChem?

michaelwitting commented 4 years ago

Import should be now possible with https://github.com/michaelwitting/MsBackendMassbank. It's working fine so far.

michaelwitting commented 4 years ago

Do you need the field names exactly as you named them? I could adjust them in MsBackendMassBank. Do you want to have the adduct mass in mass or the neutral mass?

jorainer commented 4 years ago

Regarding the field names: for the compounds table, if you have different names we can try to find a common ground for common names. For the msms_spectrum table, ideally the way I named them. I used the name of the attribute in Spectra (such as precursorMz) and replaced capital letters with _<lower case>, i.e. precursorMz -> precursor_mz.

And the mass should ideally contain the neutral monoisotopic mass. the adduct mass (m/z) should then be calculated with mass2mz or vice versa.

Let me know if something is unclear or we need to adapt.