Closed cboettig closed 10 years ago
@cboettig Nice, I like this idea. I've often thought it would be nice to send off jobs piecemeal, and quickly without having to go through lots of steps.
All those steps seem doable. More later
working on this...
@cboettig Worked more on this.
Not sure the best way to gather output of code sent over to the droplet. Right now I'm doing save.image()
to just save everything in the workspace to a .RData
file, then pull that back to the local machine, then load()
.
You might take a look at how knitr is doing this wrt caching?
Carl Boettiger http://carlboettiger.info
sent from mobile device; my apologies for any terseness or typos On Sep 29, 2014 6:18 PM, "Scott Chamberlain" notifications@github.com wrote:
@cboettig https://github.com/cboettig Worked more on this.
Not sure the best way to gather output of code sent over to the droplet. Right now I'm doing save.image() to just save everything in the workspace to a .RData file, then pull that back to the local machine, then load().
— Reply to this email directly or view it on GitHub https://github.com/sckott/analogsea/issues/25#issuecomment-57254707.
For good performance, I think you'll need to think about creating a droplet, and executing code on it in two steps. (Otherwise it's going to be really hard to get a performance improvement because you'll be swamped by the cost of spinning up a new droplet).
In general, it will be really hard to get performance gains out of this - it will be helpful for cpu bound stuff (like simulations), but anything with a lot of data, the cost will be dominated by data transfer.
@hadley Good points. Yeah, the ideal situation seems like one with no data, and simulation intensive code that doesn't output a lot of data. 2 steps is probably a better approach than the current all in 1 step.
It might also be worthwhile to set up something like doRedis
as a general framework for sending code to the droplet
good idea. redis seems especially important at larger file sizes? Or do you imagine useful at any scale
@cboettig right, good idea. I was trying to keep it simple as possible so that nothing has to be installed other than what the user is running.
@sckott I was thinking it would provide general mechanism for sending code + context. The other advantage of using redis is that it would be extremely simple to have a cluster of multiple machines, where the code gets send to the first machine that's available.
@sckott BTW unless there are strong reason otherwise, the first argument to the majority of analogsea functions should be a droplet, to make it easier to chain things together.
@hadley Right, i'll make the first arg a droplet.
Yeah, I'm on board with the redis idea.
After some discussion with @wch, this is how I think it should work:
docker run ropensci/r Rscript -e tmp/file.R -v ... --rm
- i.e. mount the temp directory into the container, and delete it when done@hadley That sounds like a good approach. I wonder, is there any reason the user would want to optionally not delete the output though?
@hadley What about cases where the output is a file or set of files? Is that the image you were referring to?
Sounds great. I think the command would be:
docker run --rm -it -v tmp:/home cboettig/ropensci Rscript ...
That will link the contents of tmp
to the /home
directory, and the outputs will be available in tmp
after the container exits and removes itself (--rm
). (A different target like /host
could be added to the image since overwriting /home
is generally not advisable, but it's empty at the moment when that image is run interactively).
The script could then decide what to do with the stuff in tmp, e.g. save/export it as output or just delete it with tmp
.
@cboettig are you sure you need -i
?
@hadley good point, you don't need the -i
.
Note that the argument to Rscript obviously needs to use the container path (e.g. /home/file.R
in this case). Also note this runs as root (the cboettig/ropensci
image, which is the same as eddelbuettel/ubuntu-ropensci
doesn't specify a user); perhaps that's not desirable.
@cboettig given that the container is thrown away after use, and the only way it can talk to the outside world is through the bound /tmp/
what are the risks of using root?
Just thinking that files created by tmp are owned by root instead of user.
So if we're logged into the droplet as a non-root user for instance, the
rm -r /tmp
command would fail permissions.
Probably not an issue for the digitalocean case. I now run R as a
container most of the time, so am careful to set user so that running
something like devtools::document()
doesn't change ownership of my man
pages.
Unless the container is 'privileged' I don't think there's much actual harm it can do with root access (besides we're talking a droplet where the user already has root anyway), but I probably shouldn't be giving security advice.
On Wed, Oct 1, 2014 at 12:18 PM, Hadley Wickham notifications@github.com wrote:
@cboettig https://github.com/cboettig given that the container is thrown away after use, and the only way it can talk to the outside world is through the bound /tmp/ what are the risks of using root?
— Reply to this email directly or view it on GitHub https://github.com/sckott/analogsea/issues/25#issuecomment-57520089.
Carl Boettiger UC Santa Cruz http://carlboettiger.info/
@cboettig it seems ok to me - I think the goal here is to make it easy to spin up another computer for personal work, not to provide a platform for generally running R code (although that does seem like a fairly natural and easy next step)
I think it's better to not run as root - that makes it too easy to accidentally overwrite/delete system files which would (at least) mess up the container. Also, I believe that if you expose your host filesystem to a docker container, the container can modify those files as root.
The official Docker line is to not run as root: https://docs.docker.com/articles/security/
And from http://fabiokung.com/2014/06/11/my-dockercon-2014-talk/:
But, escaping containers is not the only problem with root access. In the Linux Kernel, there are still many parts that haven’t been containerized. Some places in the kernel will hold global locks, or expose global resources. Some examples are the /sys and /proc filesystems. Docker does a great job preventing known problems, but if you are not careful:
# don't do this in a machine you care about, the host will halt
docker run --privileged ubuntu sh -c "echo b > /proc/sysrq-trigger"
If you are not careful protecting /sys and /proc (and again, Docker by default does the right thing AFAIK), any container can bring a whole host down, affecting other tenants running on the same box. Even when container managers do everything they can to protect the host, it might not be enough. Root users inside containers can still call syscalls or operations usually only available to root, that will cause the kernel to hold global locks and impact the host. I wouldn’t be surprised if we keep finding more of such parts of the kernel.
@wch As I mention in https://github.com/eddelbuettel/rocker/issues/12 I'm not sure how best to go about this. Any suggestions?
I do note that users logging in via RStudio server aren't root (even though the container daemon is still root) and we're not launching with --privileged
as a I mentioned earlier, so I think this primarily comes up when using -t
mode.
@cboettig why not hoist the set user stuff out of the rstudio dockerfile, and up the r-base file? Then there's you have the option of creating a non-root user, if desired.
The user setup stuff in the dockerfile is in a bash script that runs only
when the container is run, and then only if it is run without specifying a
command at runtime by CMD
directive (via supervisord to run both a shell
script and RStudio server at runtime). If a user specifies a command like
Rscript, e.g. in the case we're discussing here: docker run -t cboettig/rstudio RScript
or /bin/bash
or whatever, that script isn't run
at all. Most use-cases for the r-base image will involve that kind of
interactive use, since it's not running a server in the background, so it
didn't make sense to have a default CMD directive.
(Meanwhile, copying the commands from the bash script and pasting them into the r-base Dockerfile would mean that they were run when the image was built, and so couldn't use the runtime values of $USER, $PASSWORD, etc). hope that makes sense.
On Wed, Oct 1, 2014 at 2:21 PM, Hadley Wickham notifications@github.com wrote:
@cboettig https://github.com/cboettig why not hoist the set user stuff out of the rstudio dockerfile, and up the r-base file? Then there's you have the option of creating a non-root user, if desired.
— Reply to this email directly or view it on GitHub https://github.com/sckott/analogsea/issues/25#issuecomment-57540786.
Carl Boettiger UC Santa Cruz http://carlboettiger.info/
Oh hmmmm. What about creating a default user (with default password) in the base R dockerfile. Then other containers built on top of that could run code with that user. Otherwise, surely it's a problem for docker to solve, not us?
Certainly it seems like running with a non-root user is going to be a big hassle, and for little gain (given that we're not exposing the containers to the internet as a whole, just authenticated users to the droplet)
@hadley We could do that (either adding the user to sudoers or switching back to USER root
at the top of each dockerfile and back again at the bottom), but it becomes a hassle for us but also anyone who builds off the dockerfile and forgets to do USER root
. Like you say, feels like this is really an issue for docker to deal with. The only internet-facing part is rstudio where this is all nicely taken care of with non-root users and logins)
It would be cool if a user could send a slow or more-intensive function call off to be executed on the droplet and have the answer returned to the local R console. Such a function would:
{
like atest_that
call),Ideally this function would spawn the droplet at the beginning (installing R etc) and destroy it at the end too. This would allow a user to send computationally intensive bits to the cloud mid-script and know that the cloud instance wouldn't run any longer than necessary.
This would be similar in spirit to the way the
segue
R package works: http://jeffreybreen.wordpress.com/2011/01/10/segue-r-to-amazon-elastic-mapreduce-hadoop/