pachadotdev / analogsea

Digital Ocean R client
https://pacha.dev/analogsea/
Apache License 2.0
154 stars 24 forks source link

RStudio project to DO #17

Closed sckott closed 9 years ago

sckott commented 10 years ago

@hadley mentioned possibly sending current RStudio project to DO via this pkg, maybe using packrat to get dependencies

thoughts @kevinushey

kevinushey commented 10 years ago

What level of granularity do you need? Do you want to send up the project + a manifest file that describes where the package sources came from (and how they can be retrieved), or a project + R package sources (which could then be installed from source on the target machine)?

Either way, the workflow using packrat would be something like:

  1. Call packrat::snapshot() to generate the packrat lockfile, as well as download package sources,
  2. Send the bundle up to DO (could generate using packrat::bundle(), or just plain old tar),
  3. Call packrat::unbundle() (or just untar) to unbundle the project,
  4. Use packrat::restore() to rebuild the library the project is using.

But depending on how much machinery within packrat you want to use, you might only want packrat::snapshot() or even packrat:::appDependencies() (which just returns a vector of package names).

kevinushey commented 10 years ago

We'd definitely be interested in making packrat usable for something like this so if you think packrat is missing a necessary piece here we can try and implement something.

sckott commented 10 years ago

@kevinushey Thanks for your thoughts.

I hadn't thought much about granularity yet, but I imagine some flexibility in what a user sends would be best.

I'll start trying the integration, and get back to you if I have any questions.

cboettig commented 10 years ago

@sckott Just saw you have install capabilities on this, that's really cool.

I'm curious if you'd consider the docker approach to getting this going. I think that would have several advantages: (1) It would require way less code, (2) it would be way faster because nothing would have to compile, and (3) it would avoid having the provisioning details baked into the R functions.

Running these 2 shell commands on any DO machine will give you RStudio server at port 8787 with the username and password specified:

curl -sSL https://get.docker.io/ubuntu/ | sudo sh
sudo docker run -d -p 8787:8787 -e USER=<username> -e PASSWORD=<password> cboettig/rstudio

Will install git as well. The cboettig/ropensci image adds pandoc+latex and the hadley packages that take a long time to install from source, like dplyr with all it's suggests list.

Someone could extend this for their own packages using packrat (or with a custom docker).

Anyway this is really cool, nice work.

sckott commented 10 years ago

@cboettig right, working on installation scripts, but I still need to fix the swapping issue #15 that is preventing from installing packages correctly.

Yeah, I was gonna ask on Twitter if you thought we should have a docker integration here.

Opened a new issue for docker... #18

karthik commented 10 years ago

@cboettig all this would happen on DO once the droplet is setup? Because none of this is possible on boot2docker (the OSX port).

cboettig commented 10 years ago

@sckott Yeah, you're hitting swap / memory issues compiling code. You'll bypass all those issues with docker, because you just install binaries. If you launch the droplet with docker application already installed then you just need the second line.

@karthik Once the droplet is created. I just ran this on the smallest droplet, took about 2 minutes (almost all that is downloading the over 3 GB image, but the DO network seems super snappy).

I'm really curious what happens with your boot2docker. Where does it fail? Scott & I were able to get this up and running in his Mac. boot2docker is mostly just a lightweight virtualbox, but it's possible the virtualbox minimal provisions are too restrictive.

I created a Vagrantfile which uses the boot2docker image in order to give me a bit more control over ram and cpu, and allow me to test on linux, and that seems to work in my tests, so you might test that? https://github.com/ropensci/docker/tree/master/vagrant

cboettig commented 10 years ago

@karthik The default memory on boot2docker is 512 mb, same as on the smallest digitalocean. Looks like RStudio eats most of that and things won't work well.

Like Scott is suggesting in #15 for digitalocean, you can just add a swap file if you don't want to allocate more memory to virtualbox. (though this will be slower than adding more memory)

@sckott Just tested running some install.packages with my docker image, and yeah I hit memory limits too (e.g. running install.packages("taxize"). Toggling swap on the droplet fixed this, and the result was a lot more snappy than I expected:

sudo fallocate -l 4G /swapfile
sudo chmod 600 /swapfile
sudo mkswap /swapfile
sudo swapon /swapfile

Now install.packages runs happy on the smallest image.

karthik commented 10 years ago

I'm really curious what happens with your boot2docker. Where does it fail? Scott & I were able to get this up and running in his Mac. boot2docker is mostly just a lightweight virtualbox, but it's possible the virtualbox minimal provisions are too restrictive.

Somewhat surprised that Scott got this to work. Sure this wasn't on your Linux box @sckott?

sckott commented 10 years ago

@cboettig did you get other folks with mac's to get the docker workflow to work? I can check again if it only worked for me

sckott commented 10 years ago

@cboettig Nice, thanks for the tips on swap memory, good to know it works

cboettig commented 10 years ago

@emhart tested on mac w/ boot2docker, no problem


Carl Boettiger http://carlboettiger.info

sent from mobile device; my apologies for any terseness or typos On Aug 30, 2014 5:43 PM, "Scott Chamberlain" notifications@github.com wrote:

@cboettig https://github.com/cboettig Nice, thanks for the tips on swap memory, good to know it works

— Reply to this email directly or view it on GitHub https://github.com/sckott/analogsea/issues/17#issuecomment-53974814.

karthik commented 10 years ago

worked after upgrading docker.

sckott commented 10 years ago

@kevinushey I have a draft function do_packrat() (see https://github.com/sckott/analogsea/blob/master/R/do_packrat.R). It sort of works, but seems that the unpacked packrat project isn't available in an RStudio Server instance set up on the same machine. packrat::unbundle() ends with

The project has been unbundled and restored at: - "/root/dopackplay"

This is available when i log in in the shell, but not in the Rstudio server instance on the same machine. Should I be doing something different so the packrat project is available in Rstudio server? I think I'm missing something obvious here

kevinushey commented 10 years ago

Could there be something like chroot going on (so that the RStudio server instance doesn't see parent directories above some level)? Or is the root directory seen in RStudio server the same as in your shell?

sckott commented 10 years ago

Hmm, when I try to get to that /root dir in the server instance there seems to be nothing there, even though I know it's there, so yeah, perhaps there is something with chroot, i'll look into it

behrica commented 9 years ago

As mentioned in my comments on #48 you might want to investigate to use CoreOS to install the RStudio image. Its lighter / faster and allows to pass user_data at boot time, which allows to start docker images at boot time.

But this is then the "one Droplet - on Docker image" scenario, so you would not add Rstudio to an existing droplet, but at Droplet creation time / first boot, the docker images is started automatically

hadley commented 9 years ago

@behrica that seems like a substantially more restrictive workflow - I think it's useful in some scenarios, but being able to run multiple containers on a droplet seems pretty important.

behrica commented 9 years ago

I tried the existing code and it works well. I can see large potential to simplify the usage of docker for non-technical people in this way.

I still think that for the aim of making "every researcher using R should be able to use docker/rstudio on DO" the currently implement approach has one big drawback: It needs ssh working ...

For a lot of people this is a non-go. If you are behind a restrictive firewall (like me at work), you need to be very creative to get ssh working. Or request an exception for each new IP to the admins.

One reason why I was rather exited to use digital ocean API + CoreOS + user_data, was, that is does not need ssh at all. So it works for people with restrictive firewalls, as it is all https based.

I will experiment with the existing digitalocean API here and see what I can do without using the docklet_xxx methods and let you know.

behrica commented 9 years ago

I made a little demo here https://github.com/behrica/analogsea/tree/demoCoreOS

which uses the droplet_new function and a CoreOS image and passes user_data at initialization time. This configure CoreOS to start the docker image at boot time.

This does not need any ssh access (a ssh key need to be given just to make digitalocean happy. The API refuses to create CoreOS images without ssh key)

This should work similar with Amazon EC2, as they support user_data as well.

This approach removes the requirement of a working ssh connection.

An other advantage of CoreOS seems to be the reduced image creation time on digitalocean (at least for me. 10 seconds instead of around 50)

hadley commented 9 years ago

That's a compelling use case. But how do you communicate with the server once it's running without ssh? i.e. given a droplet running coreos (a corelet?), how can you send it an R script to execute? How can you create another docker container inside the main one?

behrica commented 9 years ago

I would have the R code in the image...

So I send somebody R code + data + the execution environment to reproduce my analysis. All this bundled in a docker image. ("sending" means sharing via a docker registry)

All further work my collaborator would do directly in RStudio in the browser. Here he could upload other files, or data if needed.

I see this clearly in the context of reproducible research. I have finished a study and want to give others a possibility to "look at it". Instead of sending them a zip file with R code or pointing to github, I send them an IP address of a digital ocean droplet containing code,data and the runtime ready to use. (or they create it themselves, depends who should pay)

Regarding the communication with the docker container, only two ways are needed:

The one single RStudio container in the droplet has its lifecycle bound to the droplet. Its a "throw-away sever". You start it with the docker image, use it, throw it away.

The advantage of docker in this is then, that I could send the SAME docker image to an other college, which has docker on his PC. So he uses it there without paying the cloud server.

I think the metaphor here is, that I want to send to all my "reviewers" a digital copy of my PC (including code, data, execution environment) And they just need a power plug to use. Using docker instead of a VM image is just for storage efficiency. As we can have docker images based on a few large base images, the individual images "for each study" can be small.

What needed to be seen is, if we want the collaborator himself be able to create from inside RStudio a new version of the docker image he is working on. I saw, you discussed this here as well. This might not be possible without ssh.

sckott commented 9 years ago

Closing this - is a very general issue with a variety of discussions. do open more specific issues if needed