pachadotdev / analogsea

Digital Ocean R client
https://pacha.dev/analogsea/
Apache License 2.0
155 stars 24 forks source link

Function for batch execution on a droplet? #25

Closed cboettig closed 10 years ago

cboettig commented 10 years ago

It would be cool if a user could send a slow or more-intensive function call off to be executed on the droplet and have the answer returned to the local R console. Such a function would:

Ideally this function would spawn the droplet at the beginning (installing R etc) and destroy it at the end too. This would allow a user to send computationally intensive bits to the cloud mid-script and know that the cloud instance wouldn't run any longer than necessary.

This would be similar in spirit to the way the segue R package works: http://jeffreybreen.wordpress.com/2011/01/10/segue-r-to-amazon-elastic-mapreduce-hadoop/

sckott commented 10 years ago

@cboettig Nice, I like this idea. I've often thought it would be nice to send off jobs piecemeal, and quickly without having to go through lots of steps.

All those steps seem doable. More later

sckott commented 10 years ago

working on this...

sckott commented 10 years ago

@cboettig Worked more on this.

Not sure the best way to gather output of code sent over to the droplet. Right now I'm doing save.image() to just save everything in the workspace to a .RData file, then pull that back to the local machine, then load().

cboettig commented 10 years ago

You might take a look at how knitr is doing this wrt caching?


Carl Boettiger http://carlboettiger.info

sent from mobile device; my apologies for any terseness or typos On Sep 29, 2014 6:18 PM, "Scott Chamberlain" notifications@github.com wrote:

@cboettig https://github.com/cboettig Worked more on this.

Not sure the best way to gather output of code sent over to the droplet. Right now I'm doing save.image() to just save everything in the workspace to a .RData file, then pull that back to the local machine, then load().

— Reply to this email directly or view it on GitHub https://github.com/sckott/analogsea/issues/25#issuecomment-57254707.

hadley commented 10 years ago

For good performance, I think you'll need to think about creating a droplet, and executing code on it in two steps. (Otherwise it's going to be really hard to get a performance improvement because you'll be swamped by the cost of spinning up a new droplet).

In general, it will be really hard to get performance gains out of this - it will be helpful for cpu bound stuff (like simulations), but anything with a lot of data, the cost will be dominated by data transfer.

sckott commented 10 years ago

@hadley Good points. Yeah, the ideal situation seems like one with no data, and simulation intensive code that doesn't output a lot of data. 2 steps is probably a better approach than the current all in 1 step.

hadley commented 10 years ago

It might also be worthwhile to set up something like doRedis as a general framework for sending code to the droplet

sckott commented 10 years ago

good idea. redis seems especially important at larger file sizes? Or do you imagine useful at any scale

sckott commented 10 years ago

@cboettig right, good idea. I was trying to keep it simple as possible so that nothing has to be installed other than what the user is running.

hadley commented 10 years ago

@sckott I was thinking it would provide general mechanism for sending code + context. The other advantage of using redis is that it would be extremely simple to have a cluster of multiple machines, where the code gets send to the first machine that's available.

hadley commented 10 years ago

@sckott BTW unless there are strong reason otherwise, the first argument to the majority of analogsea functions should be a droplet, to make it easier to chain things together.

sckott commented 10 years ago

@hadley Right, i'll make the first arg a droplet.

Yeah, I'm on board with the redis idea.

hadley commented 10 years ago

After some discussion with @wch, this is how I think it should work:

  1. Create temporary directory on droplet, and scp in .R file
  2. Run docker run ropensci/r Rscript -e tmp/file.R -v ... --rm - i.e. mount the temp directory into the container, and delete it when done
  3. scp .Rout and image back.
sckott commented 10 years ago

@hadley That sounds like a good approach. I wonder, is there any reason the user would want to optionally not delete the output though?

wch commented 10 years ago

@hadley What about cases where the output is a file or set of files? Is that the image you were referring to?

cboettig commented 10 years ago

Sounds great. I think the command would be:

docker run --rm -it -v tmp:/home cboettig/ropensci Rscript ...

That will link the contents of tmp to the /home directory, and the outputs will be available in tmp after the container exits and removes itself (--rm). (A different target like /host could be added to the image since overwriting /home is generally not advisable, but it's empty at the moment when that image is run interactively).

The script could then decide what to do with the stuff in tmp, e.g. save/export it as output or just delete it with tmp.

hadley commented 10 years ago

@cboettig are you sure you need -i?

cboettig commented 10 years ago

@hadley good point, you don't need the -i.

Note that the argument to Rscript obviously needs to use the container path (e.g. /home/file.R in this case). Also note this runs as root (the cboettig/ropensci image, which is the same as eddelbuettel/ubuntu-ropensci doesn't specify a user); perhaps that's not desirable.

hadley commented 10 years ago

@cboettig given that the container is thrown away after use, and the only way it can talk to the outside world is through the bound /tmp/ what are the risks of using root?

cboettig commented 10 years ago

Just thinking that files created by tmp are owned by root instead of user. So if we're logged into the droplet as a non-root user for instance, the rm -r /tmp command would fail permissions.

Probably not an issue for the digitalocean case. I now run R as a container most of the time, so am careful to set user so that running something like devtools::document() doesn't change ownership of my man pages.

Unless the container is 'privileged' I don't think there's much actual harm it can do with root access (besides we're talking a droplet where the user already has root anyway), but I probably shouldn't be giving security advice.

On Wed, Oct 1, 2014 at 12:18 PM, Hadley Wickham notifications@github.com wrote:

@cboettig https://github.com/cboettig given that the container is thrown away after use, and the only way it can talk to the outside world is through the bound /tmp/ what are the risks of using root?

— Reply to this email directly or view it on GitHub https://github.com/sckott/analogsea/issues/25#issuecomment-57520089.

Carl Boettiger UC Santa Cruz http://carlboettiger.info/

hadley commented 10 years ago

@cboettig it seems ok to me - I think the goal here is to make it easy to spin up another computer for personal work, not to provide a platform for generally running R code (although that does seem like a fairly natural and easy next step)

wch commented 10 years ago

I think it's better to not run as root - that makes it too easy to accidentally overwrite/delete system files which would (at least) mess up the container. Also, I believe that if you expose your host filesystem to a docker container, the container can modify those files as root.

The official Docker line is to not run as root: https://docs.docker.com/articles/security/

And from http://fabiokung.com/2014/06/11/my-dockercon-2014-talk/:

But, escaping containers is not the only problem with root access. In the Linux Kernel, there are still many parts that haven’t been containerized. Some places in the kernel will hold global locks, or expose global resources. Some examples are the /sys and /proc filesystems. Docker does a great job preventing known problems, but if you are not careful: # don't do this in a machine you care about, the host will halt docker run --privileged ubuntu sh -c "echo b > /proc/sysrq-trigger" If you are not careful protecting /sys and /proc (and again, Docker by default does the right thing AFAIK), any container can bring a whole host down, affecting other tenants running on the same box. Even when container managers do everything they can to protect the host, it might not be enough. Root users inside containers can still call syscalls or operations usually only available to root, that will cause the kernel to hold global locks and impact the host. I wouldn’t be surprised if we keep finding more of such parts of the kernel.

cboettig commented 10 years ago

@wch As I mention in https://github.com/eddelbuettel/rocker/issues/12 I'm not sure how best to go about this. Any suggestions?

I do note that users logging in via RStudio server aren't root (even though the container daemon is still root) and we're not launching with --privileged as a I mentioned earlier, so I think this primarily comes up when using -t mode.

hadley commented 10 years ago

@cboettig why not hoist the set user stuff out of the rstudio dockerfile, and up the r-base file? Then there's you have the option of creating a non-root user, if desired.

cboettig commented 10 years ago

The user setup stuff in the dockerfile is in a bash script that runs only when the container is run, and then only if it is run without specifying a command at runtime by CMD directive (via supervisord to run both a shell script and RStudio server at runtime). If a user specifies a command like Rscript, e.g. in the case we're discussing here: docker run -t cboettig/rstudio RScript or /bin/bash or whatever, that script isn't run at all. Most use-cases for the r-base image will involve that kind of interactive use, since it's not running a server in the background, so it didn't make sense to have a default CMD directive.

(Meanwhile, copying the commands from the bash script and pasting them into the r-base Dockerfile would mean that they were run when the image was built, and so couldn't use the runtime values of $USER, $PASSWORD, etc). hope that makes sense.

On Wed, Oct 1, 2014 at 2:21 PM, Hadley Wickham notifications@github.com wrote:

@cboettig https://github.com/cboettig why not hoist the set user stuff out of the rstudio dockerfile, and up the r-base file? Then there's you have the option of creating a non-root user, if desired.

— Reply to this email directly or view it on GitHub https://github.com/sckott/analogsea/issues/25#issuecomment-57540786.

Carl Boettiger UC Santa Cruz http://carlboettiger.info/

hadley commented 10 years ago

Oh hmmmm. What about creating a default user (with default password) in the base R dockerfile. Then other containers built on top of that could run code with that user. Otherwise, surely it's a problem for docker to solve, not us?

Certainly it seems like running with a non-root user is going to be a big hassle, and for little gain (given that we're not exposing the containers to the internet as a whole, just authenticated users to the droplet)

cboettig commented 10 years ago

@hadley We could do that (either adding the user to sudoers or switching back to USER root at the top of each dockerfile and back again at the bottom), but it becomes a hassle for us but also anyone who builds off the dockerfile and forgets to do USER root. Like you say, feels like this is really an issue for docker to deal with. The only internet-facing part is rstudio where this is all nicely taken care of with non-root users and logins)