Open jorainer opened 5 years ago
Also here (similar to MoNa) we will run into the issue to reduce redundancies in the compound
table. From a first look it seems we can however use "CH$LINK"
entries, e.g. providing PubChem identifiers to define unique compounds and link the MS2 spectra to those.
Compound information we can extract with the corresponding field name:
compound_id
: no explicit compound name here, but we could use one of the
external database links.compound_name
: we can have multiple "CH$NAME: "
- use one here, others
down for synonyms.inchi
: "CH$IUPAC: "
.formula
: "CH$FORMULA: "
.mass
: "CH$EXACT_MASS: "
.synonyms
: "CH$NAME: "
Additional fields we might want to get:
inchi_key
: "CH$LINK: INCHIKEY "
."CH$LINK: CHEBI "
, "CH$LINK: KEGG "
, "CH$LINK: PUBCHEM "
, "CH$LINK: CHEMSPIDER "
.We could use an self-generated identifier and collapse entries with the same identifier based on either of the ones above.
Next we need to read the full data to check how to best reduce the information:
Import should be now possible with https://github.com/michaelwitting/MsBackendMassbank. It's working fine so far.
Do you need the field names exactly as you named them? I could adjust them in MsBackendMassBank
. Do you want to have the adduct mass in mass
or the neutral mass?
Regarding the field names: for the compounds
table, if you have different names we can try to find a common ground for common names. For the msms_spectrum
table, ideally the way I named them. I used the name of the attribute in Spectra
(such as precursorMz
) and replaced capital letters with _<lower case>
, i.e. precursorMz
-> precursor_mz
.
And the mass
should ideally contain the neutral monoisotopic mass. the adduct mass (m/z) should then be calculated with mass2mz
or vice versa.
Let me know if something is unclear or we need to adapt.
Import open data from Massbank (https://github.com/MassBank/MassBank-data).
Seems the data is in nicely structured txt files, so import should be straight forward.