rocker-org / rocker

R configurations for Docker
https://rocker-project.org
GNU General Public License v2.0
1.46k stars 271 forks source link

Wiki entry on using Rocker in teaching courses #133

Open wahani opened 9 years ago

wahani commented 9 years ago

I would be interested to use the rocker project as basis for teaching R courses. I know that some are already using it for this purpose and I think it is a great application fort the project. However, I have not the experience to solve this on my own. So a Wiki page explaining the steps would be really helpful. Things I am struggling with are in the following. Most likely they will show that I am new to docker, so maybe some of the issues can be addressed by providing a link to documentation.

I guess a good starting point is the rstudio or hadleyverse container and it should be clear how a container can be tailored to the specific needs of the class.

What is the actual scenario, do you have one container with rstudio-server and multiple users, or should you use one container for each student?

How do we update a running container, such that I can add new material for exercises on all containers or user profiles?

Should the teacher make backups of the containers or file system during class? Probably we can back up a running container? Is this necessary?

What is the scenario for hosting the container? Do you use the local infrastructure of your university or rely on some cloud service? Is there a recommendation which service to use (for teaching)?

eddelbuettel commented 9 years ago

You may find some of the presentations by @mine-cetinkaya-rundel useful: she is using Docker to teach stats.

mine-cetinkaya-rundel commented 9 years ago

See slide 19 at https://github.com/mine-cetinkaya-rundel/useR-2015/blob/master/r_studio_docker.pdf for a sketch of the setup I'm using. Note that this approach does not use rocker though.

Specific answers below, some of these are from me (I teach a large intro stat course) and some are from @mccahill (Duke OIT person who set up the Docker implementation).

What is the actual scenario, do you have one container with rstudio-server and multiple users, or should you use one container for each student? One container for each student. This allows for sequestering the students from each other as much as possible, and so that the container can be restarted for a container for a specific student without affecting anyone else. In a couple cases when students found ways to mess up R/RStudio to the point that they could not log in, and the problem could be fixed with a container restart for the single affected user without touching anyone else.

How do we update a running container, such that I can add new material for exercises on all containers or user profiles? So that student's home directories persist through container restarts, we map the student home directories in the docker containers to external volumes when we run Docker. This means that we can treat the Docker container filesystem as ephemeral and restart them without losing users' work. Rather than updating a running container, we build a new version of the container and then restart the users' containers - this is how we would add new R libraries, apply patches, etc.

So user containers are started from https://github.com/mccahill/docker-rstudio something like this:

docker run -d -e USERPASS=badpassword -v /external/directory/for/user:/home/guest -p 0.0.0.0:8787:8787 -i -t r-studio

It's not a good idea to patching running containers. The containers are ephemeral. When we want to change something, we build a new version of the container, and restart it.

Should the teacher make backups of the containers or file system during class? Probably we can back up a running container? Is this necessary? You should backup the external volumes that hold the user home directories. Put user home directories in external-to-docker volume so you can update the docker container and restart it without worrying about losing user data that should persist.

Trying to back up a running container is not the approach we took. Instead, we map the parts of the filesystem that we want to persist to external volumes, and treat the rest of the container as disposable.

We backup the external-to-the-container filesystem that holds the users' home directories that are volume-mapped into the container - this backup is happening outside of and independently of the Docker container.

We don't bother backing up the individual instances of the containers themselves because they are recreated whenever we restart the container.

What is the scenario for hosting the container? Do you use the local infrastructure of your university or rely on some cloud service? Is there a recommendation which service to use (for teaching)? We started with using the local infrastructure, but I believe now some of this is on Google compute.

Basics of the implementation are at https://github.com/mccahill/docker-rstudio, but there will be some updates to this repo soon.

eddelbuettel commented 9 years ago

Just ... wow. Thanks so much for this, Mine.

wahani commented 9 years ago

Thanks so much for your answers! The docker-rstudio repo looks like a promising starting point!

eddelbuettel commented 9 years ago

For completeness, we (here at Rocker) also have a container with RStudio and the Wiki has documentation on its use.

Either one should help you. Let us know if you have suggestions for improving documentation or code.

wahani commented 9 years ago

Thanks, I certainly will. It will take some time, but maybe I just report back my findings. Maybe the wiki is the appropriate place for that.

eddelbuettel commented 9 years ago

We could just continue here in this thread til we have some consensus on what should go to the wiki.

mccahill commented 9 years ago

I updated the readme on the https://github.com/mccahill/docker-rstudio repository to describe how we run RStudio instances for several hundred users each semester.

The basic idea is to authenticate users at a web site where we maintain a mapping of their Duke netID to an RStudio instance (i.e. netID 'jane' maps to RStudio user001 and and user001's RStudio instance runs at port 30001, netID 'joe' maps to user002 and port 30002, etc.). This makes it easy to migrate users to different infrastructure by updating the user mapping to point them to different servers/ports.

The site where the users log in with their Duke netID sends users to their personal docker instance by constructing a URL for the user that will go them into the RStudio instance. This is slightly tricky because RStudio's login web page has some javascript that does a little dance to get a public key from the server and uses that public key to encrypt the user's login info, which is then sent to the server to authenticate them and start their session. If you a re going to set this up let me know and I can go into more detail about how we handled this.

One other bit of complexity came from wanting all users sessions to un over https rather than http. The free RStudio server doesn't support this directly, so we are running an nginx server to handle the SSL connections, and using docker-gen to dynamically update the nginx config file as individual docker rstudio instances are started/stopped. The docker-gen template we are using is documented in a fork of the docker-gen here: https://github.com/mccahill/docker-gen/tree/duke