Add generate_hmdb_tbl - Githubissues

jorainer commented 6 years ago

Add the generate_hmdb_tbl function to create a simple compound tibble from an xml file from HMDB.
Add related test files and unit tests.
Add documentation.

stanstrup commented 6 years ago

Is there an advantage to parsing the XML instead of the SDF? chemmineR::datablock2ma is very convenient for getting all the info.

jorainer commented 6 years ago

Reason was that I had a script to retrieve all compounds individually from HMDB, because the release files were not really up-to-date. If you query them online (e.g. http://www.hmdb.ca/metabolites/HMDB0000001.xml) you get the xml, that's basically why.

jorainer commented 6 years ago

After implementing the SDF parse function too one advantage of the xml parsing is speed. But in the end it's good to have both in place. Thanks for the suggestion!

jorainer commented 6 years ago

OK, hmdb SDF support is in. generate_hmdb_tbl can now be used with file being the file name of a HMDB file either in xml or SDF format.

jorainer commented 6 years ago

I did also add some first code to write the tbl into a SQLite database (including metadata). Next things will be: 1) Implement the CompoundDb object, that can be used to interface the database. 2) Implement the code to create an R package containing the annotation. 3) Implement all required methods to use the CompoundDb. The simplest one will be to extract all data in the form of a tbl so it can be used straight using your code.

jorainer commented 6 years ago

Right, no need to make the pull request right now - better to wait, but good that you start looking at the code, otherwise it will be too much to look at ;)

jorainer commented 6 years ago

OK, now I have all of the core stuff in place:

CompoundDb S4 object to provide access to the related SQLite database files.
createCompundDb function to create a CompoundDb SQLite database file from a compound tbl (such as created by e.g. generate_hmdb_tbl).
createCompoundDbPackage function to create an annotation package containing the CompoundDb SQLite database file (see e.g. https://github.com/jotsetung/CompoundDb.Hsapiens.HMDB.4.0 for such a package).
compounds function to extract compound data from a CompoundDb object.
src_compdb function to allow accessing the CompoundDb data in dplyr style (see ?src_compdb).
A vignette describing how to use the createCompoundDb and createCompoundDbPackage functions.
Unit tests for all internal and exported functions.

Have a look at it @stanstrup and let me know if it's OK or if you'd like changes.

stanstrup commented 6 years ago

Thanks! This is awesome. And gulp! There is a lot to look at. It might take me some days.

Just a few Qs for now 1) Is it better to have one package for each DB rather than one with a collection? 2) Are you sure on the HMDB license that you can put up the db? On the website it is indicated that you need permission. I have contacted them to hear what we can do.

jorainer commented 6 years ago

Yes, sorry that I added so much ;) - I just wanted to make sure it is at a stage where it might be useful. And for the largest part it's documentation, comments and unit tests.

Re Qs: 1) I think yes. Reasons to keep the resources separate are:

Easier to maintain, generate and provide.
Different resources will also require different licenses, and one package can only have one license.
Different resources will have different release cycles. Having each in a separate package allows to tag them with the correct version or the original resource. That is crucial for reproducible research. That's the way people do it also for gene annotations, you can have NCBI, Ensembl or UCSC based annotations, but they are provided each in their own package.

2) No, I'm pretty sure that's not the correct license, but as long as the data is not used for commercial use it should be OK (they state:

Use and re-distribution of the data, in whole or in part, for commercial purposes requires explicit permission of the authors and explicit acknowledgment of the source material (HMDB) and the original publication (see below) Once they reply I have to fix the license. In the end we will probably place a license file specific for each annotation resource into the annotation package.

stanstrup / PeakABro

Add generate_hmdb_tbl #10