roryk / ipython-cluster-helper

Tool to easily start up an IPython cluster on different schedulers.
148 stars 23 forks source link

Tweaking of ipython-cluster-helper #32

Closed mwojcikowski closed 9 years ago

mwojcikowski commented 9 years ago

I'm currently looking for a way to tweak the ipython-cluster-helper, so that it became available quicker. My cluster is not fully utilized, meaning you get all the resources you ask instantaneously. Currently setting up the cluster takes ~ 180 sec, but I can see that all the engines are ready after just around 20 sec, that includes the delayed startup of ipengines.

I tried to change heartbeat period 5000 ms (https://github.com/roryk/ipython-cluster-helper/blob/master/cluster_helper/cluster.py#L57), although that resulted in a ~120sec startup. Is that expected or maybe is there something wrong with setup?

chapmanb commented 9 years ago

Maciej; Thanks for testing this out and sorry about the delay issues. Within ipython-cluster-helper we check at intervals of 30 seconds (unless spinning up a non-distributed test cluster):

https://github.com/roryk/ipython-cluster-helper/blob/84aff2e7ec928b0fac34f0058321b948b771d71f/cluster_helper/cluster.py#L889

so I believe the delay comes from within IPython parallel itself and don't personally know of parameters to tweak to make it faster than it is. It might be worth asking for advice for the IPython folks. We're happy to try out new parameters here if you can identify things to fix. Sorry to not have something practical to try but hope this helps.

mwojcikowski commented 9 years ago

After lowering both values to 5 sec (5000 and 5), the setup time is around 30sec, which is acceptable. By setup time I mean the delay between the ipcontroller start to the start of computations.

I'm not saying these are the 'good' values, although it might be wise to present them to the user in a adjustable manner. Anyhow how costly are those checks to do them that rarely?

chapmanb commented 9 years ago

Maciej; Thanks for testing this and the suggestions. I pushed a fix which increases the pings and also concurrently allows more missed pings so we can continue to handle interruptable queues (the goal of the long timeout) without limiting startup time. If this works well for you in practice we can roll a new release with these fixes. Thanks again for all the helpful feedback.

mwojcikowski commented 9 years ago

You're welcome. I like the tool very much and intend to make some further contributions if possible. Out the top of my head:

chapmanb commented 9 years ago

Maciej; Thanks, we're always open to ideas to have better abstractions and way of using this. Within our group, we primarily use ipython-cluster-helper in bcbio, so some of the abstractions there might be helpful on what to do (or not to do, depending on your opinions). We wrap around joblib and ipython-cluster-helper to provide one way to run code locally or remotely:

https://github.com/chapmanb/bcbio-nextgen/blob/master/bcbio/distributed/prun.py https://github.com/chapmanb/bcbio-nextgen/blob/master/bcbio/distributed/ipythontasks.py

Having more lightweight ways of accomplishing this in ipython-cluster-helper would be welcome.