rformassspectrometry / CompoundDb

Creating and using (chemical) compound databases
https://rformassspectrometry.github.io/CompoundDb/index.html
16 stars 16 forks source link

Add code/description how to create a CompDb from MassBank #66

Open jorainer opened 3 years ago

jorainer commented 3 years ago

MassBank releases their databases at regular intervals and shares the data with a rather open license, which makes them an ideal candidate for annotation databases that could be distributed via Bioconductor's AnnotationHub.

Explanation: I'm building so called EnsDb databases for all species for each release of Ensembl. These databases are self-contained SQLite files with gene, transcript, exon and protein annotations and can be downloaded/fetched from AnnotationHub. This is very convenient for the user.

CompDb databases could be distributed in a similar fashion.

What I will try next is to define simple scripts to easily import data from the MassBank (MySQL database) into a CompDb database.

stanstrup commented 3 years ago

Is there an advantage to this compared to using the SDF from MoNA?

jorainer commented 3 years ago

I can not say for the content. What I like about the MassBank is that a) the license is pretty clear, so data can be (re)shared, b) MassBank makes releases, which allows to "freeze" the data - important for reproducible research and c) extracting the data directly from their database is easier than importing from text files (SDF and/or json).

jorainer commented 3 years ago

OMG - did not expect that. So, MassBank has one compound for each spectrum. Far from being a normalized database :(

michaelwitting commented 3 years ago

Yes, and the IDs differ between the different labs. Only common thing could be the InChIKey to cross-map, but never tried so far.

jorainer commented 3 years ago

Problem is that not all compounds have an inchikey - which makes it then really tricky. Well, for now I will import the data as is.

michaelwitting commented 3 years ago

Do all of them have a SMILES? Then the InChIKey could be calculated with this one: https://github.com/CDK-R/rinchi

jorainer commented 3 years ago

Indeed - it seems that all of them have SMILES. Good point - maybe you could chime in here too: https://github.com/MassBank/MassBank-web/issues/266