Closed mmckerns closed 8 years ago
Mike; This would be great. Let Rory and I know how best to integrate and borrow from pyina. We offload most of the actual work to IPython parallel and mainly are just creating batch scripts in ipython-cluster-helper but could certainly always use tips, tricks and testing for better cross scheduler support and would be happy to share.
This is a few months off, but we'll be at the pre-BOSC hackathon in July if you're attending ISMB or BOSC this year (http://www.open-bio.org/wiki/Codefest_2014). Thanks for kicking off the discussion.
Brad: BOSC is not my usual conference, but might do it if my schedule permits. actually, I may have some bioinformatics work in my near future - I can discuss over email.
pyina basically sets up mpirun (or scheduler or other similar) jobs, and wraps a multiprocessing interface around them. So at the lowest level, exchanging how to drive the different schedulers and whatnot is a win in itself. Also, configurations that are needed for certain natn'l lab machines are good -- I have a few that I don't include, but could easily. Aside from that, what we'd want to do is code to the same API… say, adaptors to the pool and pipe interface that multiprocessing uses. I already do this, and do this with pathos, so you'd get the same API for accessing other forms of parallelism (i.e. you can use programming models for execution). It's worth a chat anyway, I think.
Mike; That sounds great. Here is the high level documentation about how we currently use ipython-cluster-helper in bcbio-nextgen:
https://bcbio-nextgen.readthedocs.org/en/latest/contents/code.html#parallelization-framework
The prun
section looks similar to what you do with the Torque (or other schedulers) and Mpi classes -- set up an parallel environment to run. Then we use a run
function in the same way you use map
. We don't rely on pickling, which we've found gets complex in trying to do for both multiprocessing and ipython, but have small wrappers that handle setting up the functions for both cases, just calling out to the actual functionality (IPython: https://github.com/chapmanb/bcbio-nextgen/blob/master/bcbio/distributed/ipythontasks.py and multiprocessing: https://github.com/chapmanb/bcbio-nextgen/blob/master/bcbio/distributed/multitasks.py).
The other useful abstraction is turning a list of required resources into instructions for the schedulers to create a cluster: run these 2 programs where program A needs 3Gb memory/core and 16 cores, program B needs 1Gb memory/core and 8 cores. It handles turning the program specifications into what is actually sent to the cluster.
Hope that helps for an overview of what we're doing. Thanks again.
Awesome. Thanks for the nice writeup, I'll have a look.
Hi Mike, looks like this isn't going to happen, so closing it out. Totally happy to reopen later on. Thanks so much!
@roryk: Good move. Hopefully someone becomes annoyed enough to move it off the back burner at some point. Thanks guys, I'll keep an eyeball on your development until then.
pyina has some schedulers and whatnot supported that ipython_cluster_helper doesn't and vice-versa. https://github.com/uqfoundation/pyina/blob/master/pyina/launchers.py I also have some machine-specific configs that I keep in a dev branch in my svn.
Both packages are small… we should figure out how to better leverage each other, or at the very least steal from each other mercilessly. This is on my agenda before summer, depending on proposal and travel commitments. I'm at the labs all of the time. Or maybe you guys will be sprinting somewhere I'm attending? Feel free to send email or otherwise. I know I've brought this up before, but we should do it.