mozilla / snakepit

Machine learning job scheduler
Mozilla Public License 2.0
50 stars 16 forks source link

Snakepit

Snakepit is a machine learning job scheduler with the following features:

The Snakepit service has not gone through an in-depth security-audit yet. Therefore you should not offer unknown/random users access to your service.

Getting Started

This following instructions are intended for administrative Snakepit users who want to configure and run an own Snakepit cluster.

If you are a Snakepit end-user and just want to know how to run jobs on an existing Snakepit cluster, you should follow the snakepit-client user-guide

Big picture

Overview - three jobs on a Snakepit cluster

Prerequisites

Configuring LXD

Before Snakepit can get installed, LXD has to be configured on all involved machines (if not already done). So on each machine of your cluster you have to call

$ sudo lxd init

During the following questionnaire you'll be asked, if you want to create a new storage pool. It is highly recommended to create a copy-on-write one on base of zfs or btrfs. Each machine's storage pool should have at least 10 GB of space. On the following question you should respond with yes:

Would you like LXD to be available over the network (yes/no) [default=no]? yes

You'll be asked to set a password which will be required later during Snakepit's setup.

After Snakepit is configured and/or the machine got added, you should unset it again:

$ lxc config unset core.trust_password

Installing

All the following steps are only to be done on the head node. First you have do create a Snakepit user:

$ sudo adduser snakepit
[...]

First clone the Snakepit project. From within Snakepit's project root, you can now call:

/path/to/snakepit/clone$ sudo bin/prepare-directories.sh snakepit /snakepit

This will create the required data directory structure in /snakepit owned by user snakepit. This directory is from now on called "data-root". You could also pick a different path.

Now it's time to prepare the snakepit service:

/path/to/snakepit/clone$ sudo bin/prepare-service.sh /snakepit /path/to/snakepit/clone

This will create the snakepit LXD container and bind the data-root to its internal directory /data and /path/to/snakepit/clone to its internal directory /code. If you omit /path/to/snakepit/clone, the script will clone the project another time within the container into /code. The script is also automatically mapping the outer directory-owner of the data-root (in our case user snakepit) to its inner root user.

If you get a line with "Problem accessing https://...:8443", you have to figure out the URL for the local LXD service and run the provided command. The bin/prepare-service.sh script looks for the lxdbr0 bridge network adapter (this is a default one in LXD). If not existing, it will create and attach it to the snakepit service container as eth0. The following commands will help you figuring out the service address:

Next step is to create the worker and daemon LXD container images:

/path/to/snakepit/clone$ sudo bin/prepare-images.sh

This is a highly automated process and should not require any interaction.

After this you have the chance to install any required software into the worker image:

/path/to/snakepit/clone$ sudo lxc exec snakepit-worker -- bash
root@snakepit-worker:/root# apt install some-requirement
[...]
root@snakepit-worker:/root# exit

Before the images can be used, you have to publish them:

/path/to/snakepit/clone$ sudo bin/publish-images.sh

Configuring NFS

NFS is used for job data access. sshFS was used previously, but new workloads benefit from the faster disk access NFS allows.

Steps below assume the following internal networking layout. Adjust accordingly if different.

head node is at 192.168.1.1
worker nodes are at 192.168.2.1, 192.168.3.1, etc

Configure NFS on the head node

On the head node, install the nfs-server package.

$ sudo apt install nfs-kernel-server

As root, add the following line to the /etc/exports file.

/snakepit       192.168.0.0/16(rw,no_root_squash,no_subtree_check)

Then restart with systemctl restart nfs-server. Verify exports are working with exportfs.

Configure NFS on the worker nodes

The steps below need to be done on each worker node.

Install the nfs client package.

$ sudo apt install nfs-common

Determine the UID and GID of the snakepit user on the head node.

# on the head node

# from the system
$ id snakepit
uid=1777(snakepit) gid=1777(snakepit) groups=1777(snakepit),27(sudo),110(lxd)

# from snakepit config
$ lxc exec snakepit -- cat /etc/snakepit/snakepit.conf | grep mountUid
mountUid: "1777"

Create a snakepit user with the same UID and GID as on the head node.

NFS won't work if the UID is not the same.

$ sudo addgroup --gid 1777 snakepit
$ sudo adduser --uid 1777 --gid 1777 --disabled-password --gecos '' snakepit

Create the mount point.

$ sudo mkdir /mnt/snakepit

Edit /etc/fstab as root. Add the following line.

192.168.1.1:/snakepit   /mnt/snakepit   nfs   nosuid,hard,tcp,bg,noatime 0 0

Mount and verify that it's working.

$ sudo mount /mnt/snakepit
$ ls -la /mnt/snakepit
# there should be files owned by snakepit:snakepit

Access to Snakepit service

The snakepit service itself only provides unencrypted HTTP access. Therefore it is highly recommended to run snakepit behind a front-end web server with HTTPS configuration. The front-end server has to forward requests to port 80 of the address of the eth0 interface of the snakepit service (sudo lxc exec snakepit -- ip addr). You can check connectivity through

$ curl http://<snakepit-service-address>/hello
Here I am

For clients to be able to connect to the service, they have to have access to a so called .pitconnect.txt file. Its first line has to be the (outer) service URL without trailing slash. If you have/want to go for a self-signed HTTPS certificate of your front-end server, you can add the certificate content under that first line in the .pitconnect.txt file. The .pitconnect.txt is considered public and in case of a self-signed certificate it is to be distributed to users on a separate channel (like email). The snakepit client will only accept valid certificates or the one provided through the .pitconnect.txt file.

First time use

For the following steps you have to first install the snakepit client.

Within a directory that contains the .pitconnect.txt file (from the last step), you can now test your configuration end-to-end:

$ pit status
No user info found. Seems like a new user or first time login from this machine.
Please enter an existing or new username: tilman
Found no user of that name.
Do you want to register this usename (yes|no)? yes
Full name: Tilman Kamp
E-Mail address: ...
New password: ************
Reinput a same one to confirm it: ************
   JOB   S SINCE        UC% UM% USER       TITLE                RESOURCE 

As you are the first user, Snakepit automatically granted you admin rights:

$ pit show me
Username:         tilman
Full name:        Tilman Kamp
E-Mail address:   ...
Is administrator: yes

Adding nodes

Before one can run jobs on a worker node, the node has to be added to the snakepit service:

$ pit add node:n0 endpoint=https://...:8443
LXD endpoint password: **********

Here we gave the node the short-name "n0" and its LXD API URL as endpoint. The password is the one that was specified during LXD configuration of the node. If the node has been added successfully, this password should be unset (see LXD config section).

If the node had been added successfully, you should take a look at the node's GPUs (also called resources):

$ pit show node:n0
Node name: n0
State:     ONLINE
Resources: 
  0: "GeForce GTX 1070" (cuda 0)
  1: "GeForce GTX 1070" (cuda 1)

Time to define a model name alias:

$ pit add alias:gtx1070 name="GeForce GTX 1070"
$ pit show node:n0
Node name: n0
State:     ONLINE
Resources: 
  0: "GeForce GTX 1070" aka "gtx1070" (cuda 0)
  1: "GeForce GTX 1070" aka "gtx1070" (cuda 1)

Time to run a first test job:

$ pit run "First light" [2:gtx1070] -d 'cat /proc/driver/nvidia/gpus/**/*' -l
Job number: 190
Remote:     origin <https://github.com/...>
Hash:       ...
Diff LoC:   0
Resources:  "[2:gtx1070]"

[2018-12-14 17:04:58] [daemon] Pit daemon started
[2018-12-14 17:05:01] [worker 0] Worker 0 started
[2018-12-14 17:05:01] [worker 0] Model:          GeForce GTX 1070
[2018-12-14 17:05:01] [worker 0] IRQ:            139
[2018-12-14 17:05:01] [worker 0] GPU UUID:   ...
[2018-12-14 17:05:01] [worker 0] Video BIOS:     86.04.26.00.80
[2018-12-14 17:05:01] [worker 0] Bus Type:   PCIe
[2018-12-14 17:05:01] [worker 0] DMA Size:   47 bits
[2018-12-14 17:05:01] [worker 0] DMA Mask:   0x7fffffffffff
[2018-12-14 17:05:01] [worker 0] Bus Location:   0000:01:00.0
[2018-12-14 17:05:01] [worker 0] Device Minor:   0
[2018-12-14 17:05:01] [worker 0] Blacklisted:    No
[2018-12-14 17:05:01] [worker 0] Binary: ""
[2018-12-14 17:05:01] [worker 0] Model:          GeForce GTX 1070
[2018-12-14 17:05:01] [worker 0] IRQ:            142
[2018-12-14 17:05:01] [worker 0] GPU UUID:   ...
[2018-12-14 17:05:01] [worker 0] Video BIOS:     86.04.26.00.80
[2018-12-14 17:05:01] [worker 0] Bus Type:   PCIe
[2018-12-14 17:05:01] [worker 0] DMA Size:   47 bits
[2018-12-14 17:05:01] [worker 0] DMA Mask:   0x7fffffffffff
[2018-12-14 17:05:01] [worker 0] Bus Location:   0000:02:00.0
[2018-12-14 17:05:01] [worker 0] Device Minor:   1
[2018-12-14 17:05:01] [worker 0] Blacklisted:    No
[2018-12-14 17:05:01] [worker 0] Binary: ""
[2018-12-14 17:05:01] [worker 0] Worker 0 ended with exit code 0
[2018-12-14 17:05:01] [daemon] Worker 0 requested stop. Stopping pit...

Et voilà - you got your first snakepit cluster. For further understanding of jobs and their runtime environment, refer to the snakepit-client user-guide.

Configuration

The configuration of the snakepit service is read from a YAML file at /etc/snakepit/snakepit.conf inside the snakepit container. You can edit it through vim:

$ sudo lxc exec snakepit -- vim /etc/snakepit/snakepit.conf
$ sudo lxc exec snakepit -- systemctl restart snakepit

Possible configuration values are:

Managing data

There are four different data domains in Snakepit. All of them are represented by certain sub-directories within the data-root directory. Jobs have the same read/write rights as their owning users.

<data-root>/cache/ contains all cached git clones.

<data-root>/db.json is the database of the snakepit service.

Troubleshooting

The snakepit service is running as a regular systemd service (named "snakepit") inside the snakepit container. So you can control it through systemctl and monitor it through journalctl.

In case of a tough problem you can also stop the systemd service and run snakepit manually:

$ sudo lxc exec snakepit -- bash
root@snakepit:~# systemctl stop snakepit
root@snakepit:~# cd /code
root@snakepit:/code# npm start

> snakepit@0.0.1 start /code
> node src/service.js

get https://...:8443/1.0 
state head 1
state n0 1
get https://...:8443/1.0/containers 
pitReport []
'Snakepit service running on 0.0.0.0:80'
[...]

With configuration logLevel: 0 this should give you a good start for figuring out what's going on.

To get a better understanding of how a running job/pit looks like from LXD's perspective, you could list the running containers:

$ sudo lxc list
+---------------+---------+-----------------------+--------------------------+------------+-----------+
|     NAME      |  STATE  |         IPV4          |           IPV6           |    TYPE    | SNAPSHOTS |
+---------------+---------+-----------------------+--------------------------+------------+-----------+
| snakepit      | RUNNING | 192.168.... (eth0)    | fd42:...          (eth0) | PERSISTENT |           |
+---------------+---------+-----------------------+--------------------------+------------+-----------+
| sp-head-191-d | RUNNING | 10.125.... (eth0)     | fd42:...          (eth0) | PERSISTENT |           |
+---------------+---------+-----------------------+--------------------------+------------+-----------+
| sp-n0-191-0   | RUNNING | 10.125.... (eth0)     | fd42:...          (eth0) | PERSISTENT |           |
+---------------+---------+-----------------------+--------------------------+------------+-----------+

As you can see, a Snakepit container name (with the exception of Snakepit's service container) consists of the following parts (in given order):

Help

  1. IRC - You can contact us on the #machinelearning channel on Mozilla IRC; people there can try to answer/help

  2. Issues - If you think you ran into a serious problem, feel free to open an issue in our repo.