ropensci / taxizedb

Tools for Working with Taxonomic SQL Databases
Other
30 stars 7 forks source link

Docker image for taxize bundled with SQL Dbs #1

Closed sckott closed 4 years ago

sckott commented 8 years ago

perhaps we can provide two docker images,

Requires user to be familiar with docker, but if they are, then they don't have to deal with downloading/loading up each SQL DB

moved from ropensci/taxize#408

rossmounce commented 7 years ago

Cool, this would work for me. Although doing it natively would be preferred. Anything that works though!

sckott commented 7 years ago

what do you mean natively?

rossmounce commented 7 years ago

not having to use Docker. Just R + DB?

sckott commented 7 years ago

Sure, that's what this pkg will do, without docker - this issue is saying to make docker another option for those that want to use docker

cboettig commented 7 years ago

Cool beans. I think all this would take is writing a docker-compose.yml file. It would launch containers for postgres & mysql with a command telling them to import the relevant SQL dumps. It would also launch a rocker container (e.g. rocker/ropensci, though rocker/tidyverse might be sufficient) and link it to the DB containers.

From the user's perspective, if they had docker installed they would just run docker-compose up -d and visit localhost:8787 to access RStudio, no need to know anything more about docker an no need to know anything at all about installing and deploying SQL databases. Those databases should then be accessible to RStudio in the usual way.

cboettig commented 7 years ago

e.g. I think it's easier/more flexible to maintain a single docker-compose.yml file than a custom Dockerfile, but maybe not. The idea of just building a single container that has the databases, the data files, and R/rstudio all in one spot does have it's appeal.

sckott commented 7 years ago

I think all this would take is writing a docker-compose.yml

sorry for ignorance, but does that mean not a docker container then? If not, just put the docker-compose.yml file in a repo somewhere and point people to it?

cboettig commented 7 years ago

True, though that location could just be inst/docker/docker-compose.yml of this package to keep it all together. Might look something like:

version: "3"
services:
  mysql:
    image: mysql
    volumes:
      - $HOME/data/:/var/lib/mysql
    environment:
      - MYSQL_ROOT_PASSWORD=root
    restart: always

  rstudio:
    image: rocker/ropensci
    links:
      - mysql

because the image lines all refer to containers already available on docker-hub, there's no need for a custom Dockerfile / custom container. The only work is in linking up the containers.

This does leave open just how you want to handle accessing the SQL dumps. e.g. have the user download them to a working dir? Alternatively, you could use a custom "data" container for that, where the dumps have all been imported into the appropriate mysql / postgres databases. Then you would have something like:

version: "3"

services:
  mysql:
    image: mysql
    volumes:
      - data-volume:/var/lib/mysql
  rstudio:
    image: rocker/ropensci
    link: db

volumes:
  data-volume:
    image: ropensci/custom-data:1.0

Again the only real point of this approach is to have the software come from 'standard' containers, and the added thing is just to link everything up.

You could also drop the rocker/rstudio part, and just have the databases expose ports to localhost. That might be the cleanest, since then a user who already has R installed could call docker-compose from R to deploy the databases, and taxize could connect to them over localhost rather than over the link. Maybe that's the most sensible thing?

sckott commented 7 years ago

thanks for this! I'll try this out soon.

agree, Including in the package makes the most sense.

You could also drop the rocker/rstudio part, and just have the databases expose ports to localhost.

I like that. Though in the end i might want to support both that, and rocker/rstudio approach for those that want that.

sckott commented 7 years ago

@cboettig for the databases themselves, whats the best approach in docker-compose context:

here thinking of going the route of exposing to localhost instead of spinning up rstudio

cboettig commented 7 years ago

Good questions. Right, I think that's the best strategy. These two things can really be two sides of the same coin. e.g. you have a Dockerfile based on the relevant database docker image (e.g. mysql) which reads in the data on build. Probably use one of these for all the mysql-based dbs, another for all the postgres dbs, etc.

Then in your docker-compose you can either have it build the images locally (e.g. build directive in compose) or you can point dockerhub at the Dockerfiles on GitHub and have it build them ahead of time, so the user just has to download. (given that download time is probably the main element of build time anyhow, this probably doesn't save the user much time so maybe building locally is best, since it ensures a more up to date solution).

It's possible to go a different route and build a data-only container ahead of time, and then just use vanilla database containers with no extra build to link to it, but I think this might actually be more clumsy (e.g. you'd need a bash script to create the data-container, because it would be using separate mysql/postgres containers to populate the volume, the data-container itself would have basically no software on it, e.g. FROM busybox. I think this can also create trouble if dbs are a different version at runtime than what you use to build, so the above is probably easier.

sckott commented 6 years ago

@arendsee What are your thoughts on providing docker as an option for some or all of the databases? I've started to play with docker images for ITIS https://hub.docker.com/r/ropensci/db-itis/ and GBIF https://hub.docker.com/r/ropensci/db-gbif/

Seems like a docker option is especially useful for data sources that require mysql or postgres since they're harder to setup than sqlite, but even with sqlite that's another thing a user has to intall outside of R (or maybe sqlite comes bundled with RSQLite?).

If we want to do this, perhaps a new set of fxns, or perhaps just parameters on the download and load functions to toggle using docker or not. docker would also make usernames/passwords less of a barrier since we'd just have those set in the images (e.g., user postgres and pwd postgres for a postgres db)

arendsee commented 6 years ago

@sckott I've heard a lot about Docker, but haven't used it. As far as dependencies go, the user would still have to set up Docker on their system, though this is probably easier than setting up mysql and postgres. I've worked with a several people on my end who use the NCBI functions of taxizedb, and they have not had issues.

So, I think NCBI is fine without Docker, but your reasons for using Docker for the other databases seems solid.

sckott commented 6 years ago

yeah, user still has to install Docker. and yes, i think (?) docker will be easier to set up than mysql/postgres ( @cboettig do you agree? )

Glad to hear no problems so far with ncbi stuff

cboettig commented 6 years ago

yeah, I think that's true, not sure if the docker gotchas on recent windows-ce are resolved though.

Maybe we want to explore other options as well. How big are the databases? Maybe it's easier to export them into sqlite file instead, or fst if they'll fit in memory?

arendsee commented 6 years ago

The NCBI SQLite database is 1.4G

sckott commented 6 years ago

some are pretty small, but i think COL is like 5 or 6 GB

cboettig commented 6 years ago

At those sizes it might make sense to offer the files in fst and sqlite format since it could simplify working with them. (compression in fst is also impressive for file transfers). But being able to plug in directly to a postgres/mysql makes sense as well.

FWIW, I've just added opt-in support to rdflib to use postgres, mysql, or sqlite for the backend data storage, and put together a little docker-compose.yml to facilitate testing it (both on circle-ci and locally). Check out https://github.com/ropensci/rdflib/blob/master/docker-compose.yml (also has virtuoso, but that's probably not as relevant here). Locally you can just run:

docker-compose run rdflib R

to drop into an R console connected to each of the databases running in external containers.

sckott commented 6 years ago

nice, thanks @cboettig