Databases to convert to tables

stanstrup commented 7 years ago

From @stanstrup on October 19, 2017 13:11

Functions added to package:

[X] LipidMaps
[X] LipidBlast
[x] HMDB
[ ] MyCompoundDB
[ ] PhenolExplorer
[ ] PubChem. Too big? Not really useful?
[x] ChEBI

License situation clearified

[ ] LipidMaps
[x] LipidBlast - Confirmed CC BY. So OK with attribution.
[ ] HMDB
[ ] MyCompoundDB
[ ] PhenolExplorer
[ ] PubChem. Too big? Not really useful?
[ ] ChEBI

Please suggest.

Copied from original issue: stanstrup/PeakABro#2

stanstrup commented 7 years ago

From @jotsetung on October 19, 2017 17:1

Are you planning to add each resource (i.e. its data) to the package?

stanstrup commented 7 years ago

That was my plan if it is feasible without violating licenses. Parsing for example the json from lipidblast is very slow and people need to download a 1.6GB file. Whereas the parsed table is only 1-2 MB in rds format.

It is not very clear to me what the license situation is. As far as I know simple data cannot be copyrighted. For example a simple table from a paper should always be copyright free. But I am not sure what applies here.

stanstrup commented 7 years ago

From @jotsetung on October 20, 2017 6:49

The idea is to match compounds by (adduct) m/z, right? So you'll have some columns (like mass, id and name) that are common and have to be present in all data resources, and you might have some data resource specific columns. In that case I would change from a data.frame approach to a S4 class approach (see also issue #6). This would also hide internals (like the actual column names etc) from the user.

stanstrup commented 7 years ago

Right. I was hoping not to have DB specific columns to be able to easily mix and match though. What do you mean my hide internals?

stanstrup commented 7 years ago

From @jotsetung on October 20, 2017 7:55

Example to explain the hide the internals: this is the concept we were following for/in the AnnotationFilter, ensembldb packages:

define a common name for a filter or database attribute that the user is used to, such as genename.
define a filter that can be used to search in a (any) database for a certain gene by its name: GenenameFilter.
now, no matter which database the user is querying, he can always use the GenenameFilter to search for entries matching a certain gene name. The methods to access the data in a database have to translate it to the correct column name. So it does not matter whether the name of the column in the database table is gene_name, GeneName, genename etc. The user doesn't have to bother what the name of the column might be and use different column names across different databases.

An example here would be to have something like a InchiFilter that can be used to search for inchis in the database...

stanstrup commented 7 years ago

From @jotsetung on October 20, 2017 8:8

Regarding HMDB parsing: I did implement a simple parser to extract fields from HMDBs xml file(s): https://github.com/jotsetung/xcmsExtensions/blob/master/R/hmdb-utils.R use whatever you want/need.

stanstrup commented 7 years ago

Yeah I say that just now when poking around your repo. I actually also wrote one some years ago I was planning to add. I have a suspicion that yours is smarter though... So should I eventually import from your package or copy?

stanstrup commented 7 years ago

If you don't enforce column names in databases won't it become difficult to mix them if for example you want data from HMDB and lipidmaps for the annotation?

stanstrup commented 7 years ago

From @jotsetung on October 20, 2017 9:12

Re: code from xmcsExtensions, please copy what you need - I wan't update/use that package anymore. Your's will be much better!

Re: enforcing column names - let's wait for your use case. I agree that common column names should be used.

stanstrup commented 7 years ago

From @jotsetung on October 20, 2017 12:43

Are you already working on the HMDB import function? Otherwise I could do that do start getting my hands dirty...

stanstrup commented 7 years ago

Nope, I am working on pubchem so that would be great.

stanstrup commented 7 years ago

From @wilsontom on October 20, 2017 15:54

Hi Jan,

I parsed HMDB into a package a while ago if it's any help. And a colleague did something similar for PubChem. Some of it may be some use to you

Thanks

Tom

stanstrup commented 7 years ago

@wilsontom Thanks! But you don't supply functions to actually generate the HMDB table? I'd like that have that in the package too so it is easy to update.

I basically have PubChem working. Trying to generate the table now. Takes a while though since Pubchem is enormous. I wonder what the final size is gonna look like.

stanstrup commented 7 years ago

From @jotsetung on October 20, 2017 17:3

hmdb parsing is also on its way - I've just updated to use the xml2 package instead of the XML package.

stanstrup commented 7 years ago

I have been trying to write something feasible to handle pubchem using sqlite intermediates. The problem is that it is enormous. 130 million structures supposedly. Holding the final table in memory requires about 60GB from my approximations. An RDS file would be ~7.5GB. An sqlite file ~40GB.

So two problems: 1) Any of the usual solutions even allow such a large file? 2) People cannot use it on a regular computer without loads of memory. 3) expanding it to adducts would balloon it even more.

With an sqlite file as far as I understand you could subset it before it is read into R. I guess that might make it useful for something. Still don't know if that is feasible. And wouldn't know where to host a 40 GB sqlite file.

Thoughts?

stanstrup commented 7 years ago

From @jotsetung on October 23, 2017 3:50

The HMDB is added (see https://github.com/stanstrup/PeakABro/pull/10).

stanstrup commented 7 years ago

From @jotsetung on October 23, 2017 3:59

Re PubChem: for these large files using a tibble-based approach is not feasible. SQL might do it - or on the fly access? Do they have a web API that could be queried? The approach I have in mind might also work here: define a CompoundDb S4 object and implement all of the required methods (select etc) for it. For smaller databases these can access the internal SQLite database. We could then also implement a PubChemDb class that extends the CompoundDb and the select method could e.g. query the database online (if they provide an API) and return the results.

Regarding the adducts - I had a thought about that too: I wouldn't create all adducts for all compounds that are in the database but rather go the other way round: calculate adducts from the identified chromatographic peaks instead. That would be more efficient, because supposedly there are always less peaks to annotate than compounds in the database

stanstrup commented 7 years ago

Thanks for HMDB.

Re Pubchem: PubChem does have an API: https://pubchem.ncbi.nlm.nih.gov/pug_rest/PUG_REST_Tutorial.html#_Toc458584424 But it will be way to slow for this purpose to me. It will be thousands of compounds if you attempt to annotate a whole peaklist. I wanted specifically to get away from the whole "look up one at a time" approach. So that once you have created your annotate peaklist you can just browse around and see everything. I suggest we change to an sqlite databases in general such that larger databases can be accommodated in the same framework. I say we supply the function to generate the pubchem sqlite but don't supply it anywhere. To me annotating with all of pubchem is anyway not very useful. You always get too many irrelevant hits.

Re: CompoundDb: I think it makes sense to have such an object. Do you know if it is possible to cache generated data in the installed package folder? What would be nice is if there was:

CompoundDb <- generate_CompoundDb(dbs=c("HMDB","LipidBlast"))

--> The LipidBlast database have not been generated (initialized is a better word?) yet. Please run generate_db_lipidblast to create a cached database

generate_CompoundDb would read the included sqlite files if they exist. If generate_db_lipidblast and friends could simply add the sqlite file for the specific database to the package folder you'd need to generate each only once.

Re adducts: Yes you are right. That makes much more sense.

stanstrup commented 7 years ago

From @jotsetung on October 25, 2017 14:22

Re CompoundDb and cached - no, I don't think it's possible to cache anything in the package folder. I would keep the annotation data separately from PeakABro. What I would propose is the following: in the initial phase provide some CompoundDb objects/SQLite databases within dedicated annotation packages (e.g. https://github.com/jotsetung/CompoundDb.Hsapiens.HMDB.4.0). On the longer run: distribute them via AnnotationHub check the following:

> library(AnnotationHub)
> ah <- AnnotationHub()
updating metadata: retrieving 1 resource
  |======================================================================| 100%

snapshotDate(): 2017-10-24
> ## Look for a specific resource, like gene annotations from Ensembldb, in our
> ## case we could then search e.g. for "CompoundDb", "HMDB"
> query(ah, "EnsDb.Hsapiens.v90")
AnnotationHub with 1 record
# snapshotDate(): 2017-10-24 
# names(): AH57757
# $dataprovider: Ensembl
# $species: Homo Sapiens
# $rdataclass: EnsDb
# $rdatadateadded: 2017-08-31
# $title: Ensembl 90 EnsDb for Homo Sapiens
# $description: Gene and protein annotations for Homo Sapiens based on Ensem...
# $taxonomyid: 9606
# $genome: GRCh38
# $sourcetype: ensembl
# $sourceurl: http://www.ensembl.org
# $sourcesize: NA
# $tags: c("EnsDb", "Ensembl", "Gene", "Transcript", "Protein",
#   "Annotation", "90", "AHEnsDbs") 
# retrieve record with 'object[["AH57757"]]' 
> ## retrieve the resource:
> edb <- ah[["AH57757"]]
require(“ensembldb”)
loading from cache '/Users/jo//.AnnotationHub/64495'

This means users could fetch the resource they want from AnnotationHub and this will be cached locally. Does that make sense?

Now, I'd also like to keep separate CompoundDb objects/databases for different resources (e.g. HMDB, LipidBlast). Reason: that way you can version the resources, respectively packages (see e.g. https://github.com/jotsetung/CompoundDb.Hsapiens.HMDB.4.0). Different resources will never have the same release cycles - and versioning annotation resources is key to reproducible research.

This means also that you can't query multiple resources at the same time, but that shouldn't be a problem, is it?

stanstrup commented 7 years ago

That sounds very reasonable. It is a bit a learning curve for me with the S4 objects so I hope you have patients with me while I try to wrap my head around that.

I am wondering if it could also make sense to split the database stuff from the peak browser. It is getting more comprehensive than originally envisioned.

For the last Q: It would probably be nice to be able to annotate with multiple databases at the same time. The objective is the browser in the end where you'd want a single table with all the suggested annotations.

Any idea what to do with the very big databases? The pubchem sqlite file ended up being 43GB.

stanstrup commented 7 years ago

From @jotsetung on October 25, 2017 14:46

Re annotation with multiple databases: one could annotate with a CompoundDb for each resource and bind_rows the results. Then you'll have the final table.

Re very big database: only thing I could think of here is to use a central MySQL server hosted somewhere (eventually I could do that, not sure though). And here comes the power of the S4 objects. We define simply a PubChemDb object that extends CompoundDb. We would only have to implement the compounds or src_cmpdb (or the annotating function/method) accordingly. For the user it would be just as using a simple local SQLite-base CompoundDb object,

stanstrup commented 7 years ago

Ah ok. If nothing prevents bind_rows then it is all good. EDIT: now I understood. bind the result. Yes that works too.

Re very big database: I guess we can put that on the back-burner for now and just provide the parser. PubChem is rarely really useful for annotation anyway.

stanstrup commented 7 years ago

From @jotsetung on October 26, 2017 9:19

I am wondering if it could also make sense to split the database stuff from the peak browser. It is getting more comprehensive than originally envisioned.

Thinking it all over - eventually that might be a not too bad idea. I could focus on the database/data import stuff (with your help) and you can focus on the annotation, matching and browsing stuff.

Pros for splitting:

keep database and creation of database separate from the browser and annotator - easier to maintain.
we would not run into the Bioconductor style <-> tidyverse coding style clash. Something what I find very ugly would be e.g. create_CompoundDb, i.e. mixing CamelCase with snake_case.
You don't have to go through my pull request ;) Cons:
PeakABro will become very slim (is that a cons?)
Changes in one of the two packages will have to be reflected/fixed in the other too.

@stanstrup , what do you think?

stanstrup commented 7 years ago

In the end this is probably the most efficient way to do this so go ahead if you want.

stanstrup commented 7 years ago

From @jotsetung on October 27, 2017 7:36

OK, I'll make a repo and add you as collaborator

stanstrup commented 7 years ago

Do you want to move this issue and the other db related using https://github-issue-mover.appspot.com?

stanstrup commented 7 years ago

From @jotsetung on October 27, 2017 12:30

Or we just link to this issue? Whatever you prefer.

rformassspectrometry / CompoundDb

Databases to convert to tables #6