mschubert / clustermq

R package to send function calls as jobs on LSF, SGE, Slurm, PBS/Torque, or each via SSH
https://mschubert.github.io/clustermq/
Apache License 2.0
146 stars 27 forks source link

Beyond traditional HPC: containers and cloud computing #102

Closed wlandau closed 3 years ago

wlandau commented 6 years ago

Can clustermq use workers on AWS, Digital Ocean, arbitrary remote Docker containers, etc.? It seems straightforward, for example, to use the ssh scheduler to deploy to workers on the same AWS instance. But what about a single pool of workers spread over multiple instances?

I was at an R conference last week, and there seems to be uncertainty and debate about the long-term future of traditional HPC systems. cc @dpastoor

mschubert commented 6 years ago

Yes, that's definitely on the list.

In principle, you should already be able to use everything that you can connect to via SSH and has multicore set up. However, I have never tested anything like that.

For multiple remote machines, this will require some changes in how clustermq works. These will likely happen, but not in the near future.

chapmandu2 commented 5 years ago

Have you looked at Docker and Kubernetes to do parallel processing in the cloud? A kubernetes cluster is a lot easier to set up on AWS or Azure than a conventional cluster would be, plus you get scaling thrown in. RStudio Server Pro has just added this feature interestingly enough. I'm looking at makeClusterFunctions in batchtools and makeClusterPSOCK in future but I think Kubernetes might be better. Thanks for the great packages.

pat-s commented 5 years ago

While we currently building up a HPC, we have several standalone machines. It would be great if we could use the SSH connector to distribute jobs across all machines.

This would perfectly work together with drake and the job argument to make() which could be used to distribute the parallel jobs across as many SSH machines as possible.

mschubert commented 5 years ago

Thank you for the hints re kubernetes, @chapmandu2.

@pat-s Is there a reason why you don't set up a scheduler on your HPC? That would not only support clustermq as it is, but also many other tools interfacing with them (that you may want down the line).

pat-s commented 5 years ago

As said, we're already building a HPC with warewulf and slurm. Until then, we have several standalone servers that are used for production and cannot be turned off until there is a production ready replacement 🙂 our main goal is to combine all of them but until then, the multiple ssh approach would be a nice thing to have.

mschubert commented 5 years ago

I dropped words (the "while") while reading again, you did say. Sorry.

I'm afraid I won't have multiple SSH hosts set up in the next couple of weeks.

wlandau commented 4 years ago

What about AWS Batch? Metaflow uses it.

wlandau commented 4 years ago

Looks like the paws::batch() creates an object with a submit_job() method, though I am not sure how to return the job's data.