gvlproject / gvl.ansible.playbook

Playbook for building the Genomics Virtual Laboratory
7 stars 4 forks source link

Integration of external servers into GVL's SLURM cluster #84

Open thomcuddihy opened 7 years ago

nuwang commented 7 years ago

@thomcuddihy, can you elaborate on this? Are you thinking of adding pre-existing slurm nodes to a GVL cluster, or registering some (blank) external servers with the GVL so that the SLURM daemon will subsequently be set up on them and integrated with the cluster, or some other scenario, such as directly adding the endpoint of a running SLURM cluster?

thomcuddihy commented 7 years ago

@nuwang sorry, got lost in my inbox.

The Beatson group currently have a few physical servers that they are migrating from (to GVL/NeCTAR). It seems like a shame for such computing power to go to waste, and needlessly complicated to setup them up as a separate cluster. Instead I am looking at redeploying them (via Ansible) as blank compute nodes into the GVL SLURM cluster.

The transient FS shared via NFS should be able to be mounted on them, and I am currently exploring other storage solutions as well.

nuwang commented 7 years ago

@thomcuddihy Ok, that sounds great. It'd be really good to hear how it goes with the NFS performance etc.

Looking forward, the next version of CloudMan has been designed to do exactly this. It's layered in such a way that, the bottom most layer manages cloud nodes via cloudbridge, but the layer above only deals with IPs. Therefore, you can register arbitrary host IPs with the layer above, and they will work as part of the same Kubernetes cluster. Like winter though, it's still coming :-)

thomcuddihy commented 7 years ago

https://docs.google.com/presentation/d/e/2PACX-1vRmLR__e-KzJYwUO_MGbjx9kXGc6hOSHIbSpGxJajDjxs2-ZgrZqSKHU2DFXIenyzCRhOpZmWh1DcWF/pub?start=false&loop=false&delayms=60000

thomcuddihy commented 7 years ago

NFS as served out by GVL master node is acceptable (80mb/s write on external nodes). Currently building a minimal Ubuntu 14.04/SLURM LXC container to use for the virtual compute nodes.

thomcuddihy commented 7 years ago

Don't extend Cloudman. It has the ability to use a template to build the slurm.conf. Extend that instead. CONF_TEMPLATE_OVERRIDE_PATH = "/opt/cloudman/config/conftemplates/" /opt/cloudman/config/conftemplates/slurm.conf

thomcuddihy commented 7 years ago

see /mnt/cm/cm/conftemplates/conf_manager.py

nuwang commented 7 years ago

The default template is in /mnt/cm/cm/conftemplates/slurm.conf.default This has substitution variables that cloudman replaces with runtime values. (e.g. $master_hostname, $total_memory, $worker_nodes). These variables are not mandatory however, and you could replace the entire $worker_nodes section with a custom list of worker nodes.

nuwang commented 7 years ago

So leaving $worker_nodes as is, and then appending new worker nodes below that, looks like the easiest way to integrate new custom workers, while leaving existing cloudman worker management and scaling functionality intact.

thomcuddihy commented 7 years ago

Yeah, I was just having a look, and inserting those worker declarations using a template will be ideal. I can do some tricky stuff with the partition declarations as well. Regarding the scaling functionality: how does that get triggered? Is it based on the load from SLURM, or from the CM monitoring of the worker nodes?

nuwang commented 7 years ago

Based on the load from slurm. Mostly, it looks at job turnover to scale up and idle nodes to scale down. cloudman/cm/services/autoscale.py