Open jorainer opened 6 years ago
Is there an advantage to parsing the XML instead of the SDF? chemmineR::datablock2ma
is very convenient for getting all the info.
Reason was that I had a script to retrieve all compounds individually from HMDB, because the release files were not really up-to-date. If you query them online (e.g. http://www.hmdb.ca/metabolites/HMDB0000001.xml) you get the xml, that's basically why.
After implementing the SDF parse function too one advantage of the xml parsing is speed. But in the end it's good to have both in place. Thanks for the suggestion!
OK, hmdb SDF support is in. generate_hmdb_tbl
can now be used with file
being the file name of a HMDB file either in xml
or SDF
format.
I did also add some first code to write the tbl
into a SQLite database (including metadata). Next things will be:
1) Implement the CompoundDb
object, that can be used to interface the database.
2) Implement the code to create an R package containing the annotation.
3) Implement all required methods to use the CompoundDb
. The simplest one will be to extract all data in the form of a tbl
so it can be used straight using your code.
Right, no need to make the pull request right now - better to wait, but good that you start looking at the code, otherwise it will be too much to look at ;)
OK, now I have all of the core stuff in place:
CompoundDb
S4 object to provide access to the related SQLite database files.createCompundDb
function to create a CompoundDb
SQLite database file from a compound tbl
(such as created by e.g. generate_hmdb_tbl
).createCompoundDbPackage
function to create an annotation package containing the CompoundDb
SQLite database file (see e.g. https://github.com/jotsetung/CompoundDb.Hsapiens.HMDB.4.0 for such a package).compounds
function to extract compound data from a CompoundDb
object.src_compdb
function to allow accessing the CompoundDb
data in dplyr
style (see ?src_compdb
).createCompoundDb
and createCompoundDbPackage
functions.Have a look at it @stanstrup and let me know if it's OK or if you'd like changes.
Thanks! This is awesome. And gulp! There is a lot to look at. It might take me some days.
Just a few Qs for now 1) Is it better to have one package for each DB rather than one with a collection? 2) Are you sure on the HMDB license that you can put up the db? On the website it is indicated that you need permission. I have contacted them to hear what we can do.
Yes, sorry that I added so much ;) - I just wanted to make sure it is at a stage where it might be useful. And for the largest part it's documentation, comments and unit tests.
Re Qs: 1) I think yes. Reasons to keep the resources separate are:
2) No, I'm pretty sure that's not the correct license, but as long as the data is not used for commercial use it should be OK (they state:
Use and re-distribution of the data, in whole or in part, for commercial purposes requires explicit permission of the authors and explicit acknowledgment of the source material (HMDB) and the original publication (see below) Once they reply I have to fix the license. In the end we will probably place a license file specific for each annotation resource into the annotation package.