Open aculich opened 9 years ago
how often would a user want to use SLURM on a cloud-based cluster. Presumably the user would generally have full and exclusive control over the cloud nodes and therefore not need a scheduler?
They won't be able to use their Savio SLURM script anyway as the QoS and other details will differ, no?
That said, it would be nice if an MPI submission would just work on a cloud-based cluster based on a job previously run on Savio.
We may want to discuss how to prioritize use of AWS, Google, and Azure for BCE. I would put them in that order, I think.
On Tue, Jul 28, 2015 at 6:14 PM, Aaron Culich notifications@github.com wrote:
Enabling this in a Savio-compatible way http://research-it.berkeley.edu/services/high-performance-computing/system-overview would allow mobility of code from a local HPC cluster to Azure cloud-based HPC cluster-- without an end-user having to change their code or submit scripts.
See azure-quickstart-templates#422 https://github.com/Azure/azure-quickstart-templates/issues/422#issuecomment-125790555 for further discussion.
— Reply to this email directly or view it on GitHub https://github.com/ucberkeley/bce/issues/60.
Since Azure already provides a pre-configured Ubuntu instance running SLURM, my request to them is to simply add MPI to their image.
This addresses the use case of migrating from Savio (or similarly configure HPC cluster) directly to a cloud environment without changing the code or submit scripts-- since changing that or retraining people can take some time.
In the long run there is probably a better strategy than using SLURM, but I'm focused right now on having a very easy migration path for people. An example is our reading group topic this Thursday using Savio, Azure, and AWS.
As far as prioritizing which platforms, I am agnostic to that... really just responding to the demand for each as I hear it.
After a brief superficial look, this seems awesome - seems like it would really make it easier for people to migrate Azure <-> Savio. I agree that reducing code changes in a research environment is really the key imo. EC2 <-> Savio would then be ideal.
For some parallel and AWS functionality, I wrote some add-on scripts, so an open question is whether this SLURM+MPI functionality should be in core BCE or available as an add-on.
On Tue, Jul 28, 2015 at 7:32 PM, Chris Kennedy notifications@github.com wrote:
After a brief superficial look, this seems awesome - seems like it would really make it easier for people to migrate Azure <-> Savio. I agree that reducing code changes in a research environment is really the key imo. EC2 <-> Savio would then be ideal.
— Reply to this email directly or view it on GitHub https://github.com/ucberkeley/bce/issues/60#issuecomment-125814001.
Hadn't noticed until now that you had updated the documentation on how to use BCE to include the scripts for the parallel computation tools. I'll give that a shot.
Azure now supports RDMA, though currently only with their SUSE Linux image, but will add support for others in the future:
The current release of Azure Linux RDMA supports SUSE Linux Enterprise Server 12 (SLES12). We will continue to work with other Linux distributions and will have more to say about other supported distributions in near future. A SLES 12 image with completely integrated RDMA drivers specifically tuned for HPC workloads is available now in the Azure market place.
Enabling this in a Savio-compatible way would allow mobility of code from a local HPC cluster to Azure cloud-based HPC cluster-- without an end-user having to change their code or submit scripts.
See azure-quickstart-templates#422 for further discussion.