uqfoundation / pathos

parallel graph management and execution in heterogeneous computing
http://pathos.rtfd.io
Other
1.38k stars 89 forks source link

enable parallel streaming computation #22

Open mrocklin opened 10 years ago

mrocklin commented 10 years ago

I have a large sequence of data on disk. I'd like to stream it into memory, distribute it to various processes where expensive work is done, collect the results piece by piece onto a master process and perform a reduction. I.e. I would like the following to work:

from pathos.multiprocessing import Pool
p = Pool(32)
seq = load_lazily_from_disk(...)
out = p.map(func, seq, chunksize=100)
result = reduce(binop, out)

Sadly this doesn't work, because p.map fully evaluates my sequence. See

def mapAsync(self, func, iterable, chunksize=None, callback=None):
    '''
    Asynchronous equivalent of `map()` builtin
    '''
    assert self._state == RUN
    if not hasattr(iterable, '__len__'):
        iterable = list(iterable)   # <---fully evaluate

Maybe we can avoid this somehow? The combination of multiprocessing and streaming would be very helpful for me in particular (and others generally I think.)

mmckerns commented 10 years ago

@mrocklin: It's a good idea. I have already had one built as a summer student project, and it lives in one of my unpublished branches. It's there b/c it needs some cleanup before it goes to trunk (which is what is published on git). It was for MPI, instead of multiprocessing, but the design should be the same.

I call them "conditional maps", as you can give the map a completion criteria, and they complete when the criteria is satisfied. That way, you get both the "streaming" nature you are looking for as well as robustness (not all jobs have to come back for you are considered 'done'). I'm using them in research-grade applications, but haven't had the time to get them fit for public consumption. I know that's like saying that I have candy, but I can't share… so it does you no good… but it is on my list of things to port over from my svn branches. I have to use them in some of the upcoming contract work, and they need to be really stable… so I figure I'll need to solidify the abstraction soon, and move them into trunk.

In short, I agree. I need it too. It should be done.

mrocklin commented 10 years ago

Glad to hear it.

If I need this more quickly are you open to pull requests?

mmckerns commented 10 years ago

yes, of course

mmckerns commented 10 years ago

pathos needs a facelift, and like most of my other UQ Foundation codes, there are years of work still left in the branches to get integrated to the trunk. I'm hopeful that I get funding to support about 3 developers, with first priority to pull in useful stuff out of the branches. As it stands now, feel free to take a shot at anything you like here or otherwise. I'd also be happy to discuss design, etc.