roryk / ipython-cluster-helper

Tool to easily start up an IPython cluster on different schedulers.
148 stars 23 forks source link

Support for round_robin allocation method. #19

Closed alexbw closed 9 years ago

alexbw commented 10 years ago

$round_robin is a valid SGE allocation rule, and should be supported.

chapmanb commented 10 years ago

Alex; Thanks for the patch and bringing this issue up. My understanding on the round_robin algorithm was that it explicitly tries to distribute cores across multiple worker nodes. That's the opposite of what we want here, which is to allocate a block of cores on the same machine. Is there a SGE setup with round_robin that allows continuous filling on a single machine?

alexbw commented 10 years ago

I use this project on an EC2 cluster I occasionally spin up for large batch processing, and have found the round_robin algorithm yields higher performance by allowing fewer cores access to local memory.

The same effect can be achieved by increasing num_cores_per_job greater than 1, but that assumes the available number of cores is divisible by the requested cores per job, with no remainder. Otherwise, the submitted engine jobs will remain pending indefinitely. Round_robin, in effect, allows fractional cores per job, and that is useful if a program is not compute bound, but limited by the speed of the bus between the CPU and RAM.

As a side note, the error message is not informative, and may be especially confusing for users who do not have access to their cluster's parallel configuration.

I suppose at the very least, I could write a better error message, and write code to request a certain amount of memory per job, effectively limiting the number of jobs per node. This would achieve the same result in a more explicit way. Otherwise, some top-level way of evenly distributing jobs across available cores would be quite useful for memory-access-bound programs.

On July 6, 2014, at 8:14 PM, Brad Chapman notifications@github.com wrote:

Alex; Thanks for the patch and bringing this issue up. My understanding on the round_robin algorithm was that explicitly tries to distribute cores across multiple worker nodes. That's the opposite of what we want here, which is to allocate a block of cores on the same machine. Is there a SGE setup with round_robin that allows continuous filling on a single machine?

— Reply to this email directly or view it on GitHub.

roryk commented 9 years ago

Hi @alexbw,

Thanks for the pull request, sorry for not merging it in. Round robin doesn't seem to match up with what we want but let us know if we're wrong about that. SGE is complicated. Thanks!