This is a VCC built to run a Torque batch scheduling cluster.
It also includes C and Fortran compilers, MPICH 3.2, the MAUI scheduler and pdsh as a bonus.
The default SSH port is changed to 2222 to avoid conflicting with any other SSH instance.
If you want to build the image from scratch, just use the regular Docker process. It is not usually necessary for testing unless you want to customise the build.
This image is based from the vcc-base-centos
, therefore, will be built with a CentOS based. It should not be too difficult to port to another distribution base, since the Dockerfile will build all components from source.
docker build -t hpchud/vcc-torque .
The VCC tool is shipped inside each image and is used to make the process of starting the containers easier. A description of all available options can be invoked as follows
docker run --rm -it hpchud/vcc-torque --help
Information about the image can be obtained by running
docker run --rm -it hpchud/vcc-torque --info
You need to have a discovery service running on one of your nodes
docker run -d -p 2379:2379 --restart=always hpchud/vcc-discovery
Start a head node first
docker run -d --net=host --privileged -v /cluster:/cluster \
hpchud/vcc-torque \
--cluster=test \
--storage-host=STORAGE_HOST_IP \
--storage-port=2379 \
--service=headnode
The
/cluster
folder will be shared to the worker nodes, so it's a good idea to persist it on the head node via a Docker volume (the-v
argument)
An ID for this container will be printed to the screen.
Then, on another host, start a worker node
docker run -d --net=host --privileged \
hpchud/vcc-torque \
--cluster=test \
--storage-host=STORAGE_HOST_IP \
--storage-port=2379 \
--service=workernode
And that's it! You can now enter the head node container using SSH.
ssh -i batchuser.id_rsa batchuser@headnode.ip -p 2222
or just use the Docker to execute a shell
docker exec -it HEADNODE_CID /bin/bash
Try running pbsnodes
to see the cluster, and SSH from the head node to the worker node using it's name!
[batchuser@dbd37de43e80 /]# pbsnodes
vnode_a4aa99e7aeb6
state = free
power_state = Running
np = 8
ntype = cluster
...
[batchuser@dbd37de43e80 /]$ ssh vnode_a4aa99e7aeb6
Warning: Permanently added '[vnode_a4aa99e7aeb6]:2222,[10.10.10.3]:2222' (ECDSA) to the list of known hosts.
[batchuser@a4aa99e7aeb6 ~]$
To add more worker nodes, simply repeat the second docker run
. To start up the cluster on a single machine, just omit the --net=host
option (in this case you can use docker exec
to log in to the headnode).
To use this container for real, you should replace the keys.
This image is a multi-role image - the same Docker image provides both headnode
and workernode
roles in the context of a Torque cluster. If not specified, it will default to workernode
role.
The default role is configured in init.yml
.
The services defined in services.yml
will be launched for both roles. The following services are always required
and should not be removed.
This image is configured so that an SSH daemon will be started for both roles.
Each role also has an associated services-*.yml
file. This file contains services that must be processed just for that role.
These service files are YAML documents. A service block looks like
pbs_server:
type: daemon
exec: /usr/sbin/pbs_server -D
restart_limit: -1
requires: trqauthd
In this example, the service is the PBS server daemon. A type
of daemon will cause the service manager to write a pid file and log files to the appropriate locations, usually /run
and /var/log
respectively.
restart_limit: -1
instructs the service manager to always restart this service everytime it is killed - this is required as when the number of hosts in the cluster changes, this service will need to be restarted.
Finally, and most importantly, the PBS server states that it requires the trqauthd
service to be started before it can start. This pattern can be seen in the other service blocks to ensure execution occurs in the correct order.
If a service block does not define any requires
it will be triggered to start as soon as the service manager runs.
The pbsnodes.sh
cluster hook is executed everytime a host (or running container) is added or removed from this VCC instance. This facilitates regeneration of the Torque server's node file, and then instructs it to reload the daemon.
The headnode.sh
service hook is run when the provider of a service within the VCC changes, in this case, the headnode
cluster service. This facilitates configuration of the Torque MOM execution node client.
A user is created in the Dockerfile called batchuser
.
For the batchuser
account, an SSH key is added to the image and to the authorized_keys
file. The SSH service is configured to run on port 2222
to avoid overlapping with any SSH instance on the host system.
SSH into it using the private key in this repository, but be sure to generate new keys if you use this image in real life!
ssh -i batchuser.id_rsa batchuser@headnode.ip -p 2222
No shared filesystem is configured between the containers using this image, as the container must be run in privileged mode in order to perform a mount.
A shared filesystem may not be required at all if you customise the container image to include all the files you need.
Alternatively, if your underlying hosts are set up with a shared filesystem mounted, you can pass this through to the container as a volume.
In order to customise the cluster image, such as changing the SSH port, adding/removing users and packages, there are two options: clone this repository and make the changes, or create a new Dockerfile that extends this image.
Which option you choose depends on what you would like to do. If you want to contribute a change, please fork the repository and create a pull request!
If you want to add additional packages, the best way might be to create a new Dockerfile that is based from this image.
However, if for example you want to remove the batchuser
account, your only choice would be to make a copy of this repository and change it.