EC4Docker (Elastic Cluster for Docker)

EC4Docker is a simple Elastic Cluster whose nodes are contaniers. There exists a front-end that can be accessed by ssh, and the internal working nodes are powered on or off according to the needs (if the nodes are not used for a while, they are powered off, and they are powered on if they are needed).

Features of the cluster:

Front end that has SSH access.
Passwordless SSH access from frontend to the working nodes.
Customizable number of working nodes.
Self-managed elasticity by using CLUES.
Shared filesystem from frontend to working nodes by using NFS

EC4Docker may seem a bit useless because it is currently deployed on a single cluster, but consider its integration with Docker Swarm and you'll have an Elastic Cluster that is deployed over a multi-node infrastructure.

How to use it

Create your front-end and working node docker images.
Edit the ec4docker.config file to configure the cluster.
Use setup-cluster script to start the cluster.
Enter into the cluster.

Building the docker images

In first place, you need to chose the cluster manager middleware. Torque and SLURM are currently available, but you can create your own Dockerimage files according to your specific middleware.

Once selected, you need to build the build the front-end and working node base images by issuing the following commands:

docker build -f frontend/Dockerfile.clues -t ec4docker:frontend ./frontend/
docker build -f wn/Dockerfile -t ec4docker:wn wn/

Then you need to create the images that correspond to the middleware:

For the case of Torque, you can use the following commands:

docker build -f frontend/Dockerfile.torque -t ec4dtorque:frontend ./frontend/
docker build -f wn/Dockerfile.torque -t ec4dtorque:wn wn/

For the case of SLURM, you can use the following commands:

docker build -f frontend/Dockerfile.slurm -t ec4dslurm:frontend ./frontend/
docker build -f wn/Dockerfile.slurm -t ec4dslurm:wn wn/

The images will be built and registered in your local registry.

Alternatively you can build the non-elastic version: by not installing CLUES in the frontend. In order to make it, you can create the base images issuing the following commands:

docker build -f frontend/Dockerfile.static -t ec4docker:frontend ./frontend/
docker build -f wn/Dockerfile -t ec4docker:wn wn/

In this case you need to power the nodes on or of by hand (using the provided scripts in folder /opt/ec4docker).

NOTE: you are advised to modify the Dockerfile files in order to include your libraries, applications, etc. to customize your cluster. Another option is to build the provided Dockerfiles and create your owns that start from the created one (you can check the FROM clause in the Dockerfile file).

Configure the cluster

You should create a config file (ec4docker.config) to set the name of your cluster (this name will be set for the front-end node in docker), the base name for the working nodes (they should be named as _basename_1, _basename_2, etc.) and the max amount of computing nodes. You must also set the names of the docker images according to the previous step.

Two examples are provided:

The file ec4docker-torque.config for the case of Torque:

EC4DOCK_SERVERNAME=ec4docker
EC4DOCK_MAXNODES=4
EC4DOCK_FRONTEND_IMAGENAME=ec4dtorque:frontend
EC4DOCK_WN_IMAGENAME=ec4dtorque:wn
EC4DOCK_NODEBASENAME=ec4dockernode

And the file ec4docker-slurm.config for the case of Torque:

EC4DOCK_SERVERNAME=ec4docker
EC4DOCK_MAXNODES=4
EC4DOCK_FRONTEND_IMAGENAME=ec4dslurm:frontend
EC4DOCK_WN_IMAGENAME=ec4dslurm:wn
EC4DOCK_NODEBASENAME=ec4dockernode

NOTE: In this file the cluster will be named ec4docker and the maximum number of working nodes is set to 4. You are advised to change the name of your frontend and the amount of working nodes that will be available.

Create the cluster

You can use the script setup-cluster to create the front-end of the cluster, from the corresponding docker image. If the cluster already exists, this script will ask you to kill it.

IMPORTANT: In order to be able to use the NFS shared filesystem, you MUST enable nfsd module in the kernel of the docker servers that hosts the containers.

$ modprobe nfsd

In order to create your cluster, defined in ec4docker-torque.config file, you can issue the following command:

$ ./ec4docker -ct -f ec4docker-torque.config

NOTE: The settings of the clusters are those that are set in file ec4docker-torque.config file. Take note of those settings because you will need them in order to access the cluster. In special, the name of the cluster which is in _EC4DOCKSERVERNAME.

WARNING: The cluster is created on a Docker aside Docker approach. That means that the front-end will issue docker calls to create and to destroy the docker containers that will serve as working nodes from the cluster. But these docker containers will be created in the docker host that started the front-end. In order to use this approach, the docker communication socket and the docker binary from the host are shared with the container.

Enter the cluster

Once the front-end has been created you can enter into the front-end container and su as the ubuntu user (which is the only user created in the cluster). An example of the command like is provided next (the name of the container depends on your configuration; i.e. the ec4docker.config file):

$ docker exec -it ec4docker /bin/bash
root@ec4docker:/$ su - ubuntu

Altenatively you can ssh the front-end. The SSH is exposed in the creation of the frontend, so you can guess the port where the front-end will listen by using the docker port command:

$ docker port ec4docker
22/tcp -> 0.0.0.0:32770

In this example, you can ssh to ubuntu@localhost at port 32770 with a command like the next one (the default password is "ubuntu", and it is set in the Dockerfile):

$ ssh -p 32770 ubuntu@localhost

Now you can issue commands to the queue, and CLUES will intercept the call and will power on some working nodes in the cluster.

An example is the next:

$ echo "hostname && sleep 10" | qsub
1.ec4docker
$ qstat                             
Job id                    Name             User            Time Use S Queue
------------------------- ---------------- --------------- -------- - -----
1.ec4docker               STDIN            ubuntu                 0 R batch 
$ ls -l
total 4
-rw------- 1 ubuntu ubuntu  0 Feb 12 11:15 STDIN.e1
-rw------- 1 ubuntu ubuntu 13 Feb 12 11:15 STDIN.o1

NOTE: For the non-elasic version, you can power on some nodes from inside the front-end, by hand by issuing commands like the next:

$ /opt/ec4docker/poweron ec4dockernode1
$ /opt/ec4docker/poweron ec4dockernode2

Troubleshooting

If any of the docker containers fail (for any reason), please check the output of the command docker logs <container>.

Some common issues are:

Docker fails at removing a container (i.e. docker rm command fails) because it is in use. In this case you need to try to remove the container by hand or (under some circumnstances) restart the docker daemon.
The nfsd module is not enabled and then the mount point for the working nodes is not enabled. Torque cannot write in the shared folder and the execution of commands fail. In this case you should try to enable nfsd module and restart the cluster in order to execute the bootstrapping process again.

grycap / ec4docker

readme