refactor: add RaMP class and utilities to download dbs

jorainer commented 1 year ago

Add the RaMP class and constructor function.
Add listRaMPVersions function.
Add utility functions to cache and manage RaMP databases.

This PR contains a suggestion how to maybe manage different versions/releases of RaMP-DB:

The RaMP (S4) object is thought to contain the connection to a RaMP database.
The RaMP constructor function allows to either (download and) connect to a cached (SQLite) database, or connect to a e.g. MySQL database (if the connection information is provided with parameter dbcon).
Downloading of the database does no longer happen automatically when the user loads the package, but the first time the user calls RaMP().
The RaMP object could be passed to each function that can then use the provided SQL connection to query the database.

Example use case:

> library(RaMP)
Loading Ramp version: 2.3.2
No cached RaMP databases present. Use 'RaMP()' to download the most recent release. See '?RaMP' for more information.
> ## List available locally cached RaMP database versions
> listRaMPVersions(local = TRUE)
charater()
> ## List available remote RaMP database versions
> listRaMPVersions()
[1] "2.3.0"
> ## Initialize the RaMP object; without providing a version the latest version will be loaded/downloaded by default
> db <- RaMP()
Downloading RaMP-DB version 2.3.0
  |==================================================================================| 100%
> db
RaMP release 2.3.0 
> listRaMPVersions(local = TRUE)
[1] "2.3.0"
> ## Calling it again will load it from cache
> db <- RaMP()
Loading RaMP-DB version 2.3.0 from cache.

The actual analysis functions could then have an additional parameter ramp = RaMP() and they could use the provided database connection (in slot @dbcon) to query the database. A call could thus be:

getAnalyteFromPathway(ramp = db, pathway = "sphingolipid metabolism")

Internally, getAnalyteFromPathway could use the ramp@dbcon SQL connection to query the database.

I know this would be quite some change to the current code in the package - but IMHO it would provide users some more control over which RaMP database and version they want to use. If more RaMP-DB releases become available it would allow them for example to reproduce older analyses by specifically set the RaMP-DB version to use with db <- RaMP("2.3.0").

Happy to discuss further :)

jorainer commented 1 year ago

Before actually hacking into the other functions I wanted to first get your general thoughts on this setup/approach.

johnbraisted commented 1 year ago

Hi Johannes,

Nice work. Thank you. I had simultaneously removed the DB fetch from the .onLoad(...) in zzz.R and have about 8 commits since you forked. I think it's OK. I have the check and fetch in setRampConnection(...). But, having the class to manage this seems like a good option to me. The use of a DB object reminds of a 'mart' in biomaRt or Seurat or DESeq objects in those packages.

I have a runQuery(sql) method that does the actual query. We could pass the ramp db object into there. I kind of like having the ramp db connection information in the RaMP object as opposed to the connection. Open a connection doesn't have that much overhead, in the scheme of things. Our API uses the package and will continue to use MySQL, I think. MySQL might be best for the API (and web UI that uses the API) for concurrent users.

I like having the RaMP object. Currently we have pkg.globals global variable (environment). RaMP could serve that purpose. I'm not sure if there is a benefit or cost to either approach.

Right now the db_version table holds the figshare url, just as a reference so that we can refer to it on our website. The DESCRIPTION file holds the URL to the github LFS sqlite db. *I pushed a new one out with just a slight change.

We also used to have Rdata objects in the package that had to be synchronized with DB version for compatibility. Those objects are now retrieved from the DB during setConnectionToRaMP, if they are not already set. I think they used be be instantiated from sysdata.rdb, which is now removed. Those data frames could be accessed by 'RaMP:::'. My current implementation has these as global variables, but I guess that they could be slots in the RaMP object as a container.

I'll connect by email. Overall I like the changes. I can't merge them now. Maybe you could do a 'pull' on your side if possible. It may just be that I don't have zzz.R right now, but reporting on the RaMP package and DB versioning is something to keep in there.

Thanks again.

jorainer commented 1 year ago

Thanks for the feedback!

For RaMP object vs pkg.globals - coming from the Bioconductor worls the RaMP object feels more natural. That's also how things are done in GenomicFeatures (-> TxDb objects providing gene/transcript annotations) or similar. To me the main advantage would be that the user could have eventually two RaMP objects each pointing to a different version of RaMP-DB (will not be the most common use case though).

I will change the RaMP object to contain just the connection information and not the connection itself and update the PR. But maybe better to catch up also via email to see if my suggestion is actually worth to follow.

johnbraisted commented 1 year ago

Thanks for your input on pkg.global (environments) vs the RaMP class as a container. It's helpful to have your perspective on standard practices for Bioconductor. I'll write to you on email and 'cc Ewy, Andy Patt and Tim Sheils who develop aspects of RaMP. Tim developed and maintains our web client and API. I think those can be easily adapted. Andy Patt has worked on RaMP for a while, since back at Ohio State University. He's primarily leveraging RaMP for a number of analytical approaches that are not currently in RaMP, kind of like extensions. I want to keep them in the loop as things progress. Thanks.

jorainer commented 1 year ago

@johnbraisted I refactored the class to store the credential information in the object (instead of the database connection). For your API and the web instance it would be possible to create a RaMP object e.g. using (note the .RaMP is a package internal function not supposed to be used by the standard user - in contrast to the RaMP function (without the .)):

library(RMariaDB)
db <- RaMP:::.Ramp(drv = MariaDB(), dbname = ...)

For the standard usage (using the SQLite databases) it works at the moment like this:

> library(RaMP)
> listRaMPVersions()
[1] "2.3.0"
> db <- RaMP("2.3.0")
Downloading RaMP-DB version 2.3.0
  |==================================================================================| 100%

> db
RaMP 

> runQuery("select distinct ramp_version from db_version", db)
  ramp_version
1        2.3.0
> db2 <- RaMP()
Loading RaMP-DB version 2.3.0 from cache.
>

The runQuery gained a second parameter ramp that allows to pass the RaMP object. Would be good that also all other functions that retrieve data from the database have this parameter. Generally, it would be cleaner to have this parameter as a first parameter, but I did not want to eventually break the current usage.

Below I discuss/describe/explain some of the other changes I made. Feel free to change/adapt/delete any of my code.

johnbraisted commented 1 year ago

Accepted changes. Will refactor connection parts of ramp object and start to update methods to use the ramp object.

jorainer commented 1 year ago

Thanks John. Please let me know if I can provide/contribute anything else!

johnbraisted commented 1 year ago

Hi Johannes. Thanks. I'm refactoring the methods to add the ramp object. So far testing is good. I'm testing mysql and sqlite side-by-side. I'm also adding the method to capture those Rdata objects that support enrichment, storing within the RaMP object. I'll just take care of those, test and make sure it's stable. I'll send an email when things settle.

ncats / RaMP-DB

refactor: add RaMP class and utilities to download dbs #57