Open stanstrup opened 7 years ago
From @jotsetung on October 19, 2017 17:1
Are you planning to add each resource (i.e. its data) to the package?
That was my plan if it is feasible without violating licenses. Parsing for example the json from lipidblast is very slow and people need to download a 1.6GB file. Whereas the parsed table is only 1-2 MB in rds format.
It is not very clear to me what the license situation is. As far as I know simple data cannot be copyrighted. For example a simple table from a paper should always be copyright free. But I am not sure what applies here.
From @jotsetung on October 20, 2017 6:49
The idea is to match compounds by (adduct) m/z, right?
So you'll have some columns (like mass, id and name) that are common and have to be present in all data resources, and you might have some data resource specific columns.
In that case I would change from a data.frame
approach to a S4
class approach (see also issue #6). This would also hide internals (like the actual column names etc) from the user.
Right. I was hoping not to have DB specific columns to be able to easily mix and match though. What do you mean my hide internals?
From @jotsetung on October 20, 2017 7:55
Example to explain the hide the internals: this is the concept we were following for/in the AnnotationFilter
, ensembldb
packages:
genename
.GenenameFilter
.GenenameFilter
to search for entries matching a certain gene name. The methods to access the data in a database have to translate it to the correct column name. So it does not matter whether the name of the column in the database table is gene_name
, GeneName
, genename
etc. The user doesn't have to bother what the name of the column might be and use different column names across different databases.An example here would be to have something like a InchiFilter
that can be used to search for inchis in the database...
From @jotsetung on October 20, 2017 8:8
Regarding HMDB parsing: I did implement a simple parser to extract fields from HMDBs xml file(s): https://github.com/jotsetung/xcmsExtensions/blob/master/R/hmdb-utils.R use whatever you want/need.
Yeah I say that just now when poking around your repo. I actually also wrote one some years ago I was planning to add. I have a suspicion that yours is smarter though... So should I eventually import from your package or copy?
If you don't enforce column names in databases won't it become difficult to mix them if for example you want data from HMDB and lipidmaps for the annotation?
From @jotsetung on October 20, 2017 9:12
Re: code from xmcsExtensions
, please copy what you need - I wan't update/use that package anymore. Your's will be much better!
Re: enforcing column names - let's wait for your use case. I agree that common column names should be used.
From @jotsetung on October 20, 2017 12:43
Are you already working on the HMDB import function? Otherwise I could do that do start getting my hands dirty...
Nope, I am working on pubchem so that would be great.
@wilsontom Thanks! But you don't supply functions to actually generate the HMDB table? I'd like that have that in the package too so it is easy to update.
I basically have PubChem working. Trying to generate the table now. Takes a while though since Pubchem is enormous. I wonder what the final size is gonna look like.
From @jotsetung on October 20, 2017 17:3
hmdb parsing is also on its way - I've just updated to use the xml2
package instead of the XML
package.
I have been trying to write something feasible to handle pubchem using sqlite intermediates. The problem is that it is enormous. 130 million structures supposedly. Holding the final table in memory requires about 60GB from my approximations. An RDS file would be ~7.5GB. An sqlite file ~40GB.
So two problems: 1) Any of the usual solutions even allow such a large file? 2) People cannot use it on a regular computer without loads of memory. 3) expanding it to adducts would balloon it even more.
With an sqlite file as far as I understand you could subset it before it is read into R. I guess that might make it useful for something. Still don't know if that is feasible. And wouldn't know where to host a 40 GB sqlite file.
Thoughts?
From @jotsetung on October 23, 2017 3:50
The HMDB is added (see https://github.com/stanstrup/PeakABro/pull/10).
From @jotsetung on October 23, 2017 3:59
Re PubChem: for these large files using a tibble-based approach is not feasible. SQL might do it - or on the fly access? Do they have a web API that could be queried? The approach I have in mind might also work here: define a CompoundDb
S4 object and implement all of the required methods (select
etc) for it. For smaller databases these can access the internal SQLite
database. We could then also implement a PubChemDb
class that extends the CompoundDb
and the select
method could e.g. query the database online (if they provide an API) and return the results.
Regarding the adducts - I had a thought about that too: I wouldn't create all adducts for all compounds that are in the database but rather go the other way round: calculate adducts from the identified chromatographic peaks instead. That would be more efficient, because supposedly there are always less peaks to annotate than compounds in the database
Thanks for HMDB.
Re Pubchem: PubChem does have an API: https://pubchem.ncbi.nlm.nih.gov/pug_rest/PUG_REST_Tutorial.html#_Toc458584424 But it will be way to slow for this purpose to me. It will be thousands of compounds if you attempt to annotate a whole peaklist. I wanted specifically to get away from the whole "look up one at a time" approach. So that once you have created your annotate peaklist you can just browse around and see everything. I suggest we change to an sqlite databases in general such that larger databases can be accommodated in the same framework. I say we supply the function to generate the pubchem sqlite but don't supply it anywhere. To me annotating with all of pubchem is anyway not very useful. You always get too many irrelevant hits.
Re: CompoundDb
: I think it makes sense to have such an object.
Do you know if it is possible to cache generated data in the installed package folder?
What would be nice is if there was:
CompoundDb <- generate_CompoundDb(dbs=c("HMDB","LipidBlast"))
--> The LipidBlast database have not been generated (initialized is a better word?) yet. Please run generate_db_lipidblast to create a cached database
generate_CompoundDb
would read the included sqlite files if they exist. If generate_db_lipidblast and friends could simply add the sqlite file for the specific database to the package folder you'd need to generate each only once.
Re adducts: Yes you are right. That makes much more sense.
From @jotsetung on October 25, 2017 14:22
Re CompoundDb
and cached - no, I don't think it's possible to cache anything in the package folder. I would keep the annotation data separately from PeakABro
. What I would propose is the following: in the initial phase provide some CompoundDb
objects/SQLite databases within dedicated annotation packages (e.g. https://github.com/jotsetung/CompoundDb.Hsapiens.HMDB.4.0). On the longer run: distribute them via AnnotationHub
check the following:
> library(AnnotationHub)
> ah <- AnnotationHub()
updating metadata: retrieving 1 resource
|======================================================================| 100%
snapshotDate(): 2017-10-24
> ## Look for a specific resource, like gene annotations from Ensembldb, in our
> ## case we could then search e.g. for "CompoundDb", "HMDB"
> query(ah, "EnsDb.Hsapiens.v90")
AnnotationHub with 1 record
# snapshotDate(): 2017-10-24
# names(): AH57757
# $dataprovider: Ensembl
# $species: Homo Sapiens
# $rdataclass: EnsDb
# $rdatadateadded: 2017-08-31
# $title: Ensembl 90 EnsDb for Homo Sapiens
# $description: Gene and protein annotations for Homo Sapiens based on Ensem...
# $taxonomyid: 9606
# $genome: GRCh38
# $sourcetype: ensembl
# $sourceurl: http://www.ensembl.org
# $sourcesize: NA
# $tags: c("EnsDb", "Ensembl", "Gene", "Transcript", "Protein",
# "Annotation", "90", "AHEnsDbs")
# retrieve record with 'object[["AH57757"]]'
> ## retrieve the resource:
> edb <- ah[["AH57757"]]
require(“ensembldb”)
loading from cache '/Users/jo//.AnnotationHub/64495'
This means users could fetch the resource they want from AnnotationHub
and this will be cached locally. Does that make sense?
Now, I'd also like to keep separate CompoundDb
objects/databases for different resources (e.g. HMDB, LipidBlast). Reason: that way you can version the resources, respectively packages (see e.g. https://github.com/jotsetung/CompoundDb.Hsapiens.HMDB.4.0). Different resources will never have the same release cycles - and versioning annotation resources is key to reproducible research.
This means also that you can't query multiple resources at the same time, but that shouldn't be a problem, is it?
That sounds very reasonable. It is a bit a learning curve for me with the S4 objects so I hope you have patients with me while I try to wrap my head around that.
I am wondering if it could also make sense to split the database stuff from the peak browser. It is getting more comprehensive than originally envisioned.
For the last Q: It would probably be nice to be able to annotate with multiple databases at the same time. The objective is the browser in the end where you'd want a single table with all the suggested annotations.
Any idea what to do with the very big databases? The pubchem sqlite file ended up being 43GB.
From @jotsetung on October 25, 2017 14:46
Re annotation with multiple databases: one could annotate with a CompoundDb
for each resource and bind_rows
the results. Then you'll have the final table.
Re very big database: only thing I could think of here is to use a central MySQL
server hosted somewhere (eventually I could do that, not sure though). And here comes the power of the S4
objects. We define simply a PubChemDb
object that extends CompoundDb
. We would only have to implement the compounds
or src_cmpdb
(or the annotating function/method) accordingly. For the user it would be just as using a simple local SQLite-base CompoundDb
object,
Ah ok. If nothing prevents bind_rows then it is all good. EDIT: now I understood. bind the result. Yes that works too.
Re very big database: I guess we can put that on the back-burner for now and just provide the parser. PubChem is rarely really useful for annotation anyway.
From @jotsetung on October 26, 2017 9:19
I am wondering if it could also make sense to split the database stuff from the peak browser. It is getting more comprehensive than originally envisioned.
Thinking it all over - eventually that might be a not too bad idea. I could focus on the database/data import stuff (with your help) and you can focus on the annotation, matching and browsing stuff.
Pros for splitting:
create_CompoundDb
, i.e. mixing CamelCase with snake_case.PeakABro
will become very slim (is that a cons?)@stanstrup , what do you think?
In the end this is probably the most efficient way to do this so go ahead if you want.
From @jotsetung on October 27, 2017 7:36
OK, I'll make a repo and add you as collaborator
Do you want to move this issue and the other db related using https://github-issue-mover.appspot.com?
From @jotsetung on October 27, 2017 12:30
Or we just link to this issue? Whatever you prefer.
From @stanstrup on October 19, 2017 13:11
Functions added to package:
License situation clearified
Please suggest.
Copied from original issue: stanstrup/PeakABro#2