Open thomcuddihy opened 7 years ago
@nuwang sorry, got lost in my inbox.
The Beatson group currently have a few physical servers that they are migrating from (to GVL/NeCTAR). It seems like a shame for such computing power to go to waste, and needlessly complicated to setup them up as a separate cluster. Instead I am looking at redeploying them (via Ansible) as blank compute nodes into the GVL SLURM cluster.
The transient FS shared via NFS should be able to be mounted on them, and I am currently exploring other storage solutions as well.
@thomcuddihy Ok, that sounds great. It'd be really good to hear how it goes with the NFS performance etc.
Looking forward, the next version of CloudMan has been designed to do exactly this. It's layered in such a way that, the bottom most layer manages cloud nodes via cloudbridge, but the layer above only deals with IPs. Therefore, you can register arbitrary host IPs with the layer above, and they will work as part of the same Kubernetes cluster. Like winter though, it's still coming :-)
NFS as served out by GVL master node is acceptable (80mb/s write on external nodes). Currently building a minimal Ubuntu 14.04/SLURM LXC container to use for the virtual compute nodes.
Don't extend Cloudman. It has the ability to use a template to build the slurm.conf. Extend that instead. CONF_TEMPLATE_OVERRIDE_PATH = "/opt/cloudman/config/conftemplates/" /opt/cloudman/config/conftemplates/slurm.conf
see /mnt/cm/cm/conftemplates/conf_manager.py
The default template is in /mnt/cm/cm/conftemplates/slurm.conf.default This has substitution variables that cloudman replaces with runtime values. (e.g. $master_hostname, $total_memory, $worker_nodes). These variables are not mandatory however, and you could replace the entire $worker_nodes section with a custom list of worker nodes.
So leaving $worker_nodes as is, and then appending new worker nodes below that, looks like the easiest way to integrate new custom workers, while leaving existing cloudman worker management and scaling functionality intact.
Yeah, I was just having a look, and inserting those worker declarations using a template will be ideal. I can do some tricky stuff with the partition declarations as well. Regarding the scaling functionality: how does that get triggered? Is it based on the load from SLURM, or from the CM monitoring of the worker nodes?
Based on the load from slurm. Mostly, it looks at job turnover to scale up and idle nodes to scale down. cloudman/cm/services/autoscale.py
@thomcuddihy, can you elaborate on this? Are you thinking of adding pre-existing slurm nodes to a GVL cluster, or registering some (blank) external servers with the GVL so that the SLURM daemon will subsequently be set up on them and integrated with the cluster, or some other scenario, such as directly adding the endpoint of a running SLURM cluster?