pymc-devs / pymc

Bayesian Modeling and Probabilistic Programming in Python
https://docs.pymc.io/
Other
8.59k stars 1.98k forks source link

Alternative pickler for multiprocessing? #413

Closed jsalvatier closed 9 years ago

jsalvatier commented 10 years ago

The lack of ability to pickle nested functions is a constant thorn in my side. Apparently there are at least two alternative pickling libraries that supports nested functions: http://stackoverflow.com/questions/16626429/python-cpickle-pickling-lambda-functions. Anyone know if we could somehow use this for multiprocessing in psample?

jsalvatier commented 10 years ago

The answer is apparently that you can use an alternative library to multiprocessing: pathos http://stackoverflow.com/questions/16626429/python-cpickle-pickling-lambda-functions. However, that is I guess a little stale.

Still this is encouraging. This would simplify development a fair bit since tracking down why pickling is failing is fairly irritating. It would also make the code a bit clearer is a couple of places.

I don't know how stable Dill, picloud or pathos are though. It may not be wise to rely on them. Then again, they're only needed for psample. Thoughts?

twiecki commented 10 years ago

I used dill a fair amount and it works pretty reliably. It's also getting semi-actively developed and they released a new stable not too long ago. pathos is actually just a multiprocessing fork using dill under the hood.

There's a good overview by @mrocklin here: http://matthewrocklin.com/blog/work/2013/11/25/Parallelism-and-Serialization/

I think we should aim to be backend-independent though and take a pool as an argument. They mostly provide the same interface (e.g. map()). Having said that, we might want to offer a solid default backend. Since psample() seemed to semi-work can you be more specific in which cases or for which models it fails?

mrocklin commented 10 years ago

I've started to write code that takes map as an argument. In this way you can pass in __builtin__.map some Pool.map or even IPython's view.map_sync if you have a distributed setup.

I think that map is the correct abstraction for parallelism for many of our applications.

mrocklin commented 10 years ago

I've had installation issues with both dill and pathos. I think Mike is working on them. In general I don't depend on but do recommend them.

jsalvatier commented 10 years ago

I had a bunch of problems converting the distributions from quickclass to regular python classes because you can't use dynamically generated classes or functions. The PointFunc (and one or two other classes) only exist because we can't have a function that returns a function.

On Thu, Dec 5, 2013 at 8:13 AM, Matthew Rocklin notifications@github.comwrote:

I've had installation issues with both dill and pathos. I think Mike is working on them. In general I don't depend on but do recommend them.

— Reply to this email directly or view it on GitHubhttps://github.com/pymc-devs/pymc/issues/413#issuecomment-29911164 .

jsalvatier commented 10 years ago

For example, I can't attach a function to calculate the logp to free random variables, I must attach the actual logp because otherwise psample will break.

On Thu, Dec 5, 2013 at 10:29 AM, John Salvatier jsalvatier@gmail.comwrote:

I had a bunch of problems converting the distributions from quickclass to regular python classes because you can't use dynamically generated classes or functions. The PointFunc (and one or two other classes) only exist because we can't have a function that returns a function.

On Thu, Dec 5, 2013 at 8:13 AM, Matthew Rocklin notifications@github.comwrote:

I've had installation issues with both dill and pathos. I think Mike is working on them. In general I don't depend on but do recommend them.

— Reply to this email directly or view it on GitHubhttps://github.com/pymc-devs/pymc/issues/413#issuecomment-29911164 .

mrocklin commented 10 years ago

Are these issues with pickle or with all pickle clones?

SymPy's web-service, sympy live, uses dill to serialize just about everything within the SymPy codebase. I don't think we dynamically generate classes but we do do a lot of weird metaclassing and dynamic function generation.

mrocklin commented 10 years ago

@lidavidm might have some wisdom to share here.

jsalvatier commented 10 years ago

Those are issues with pickle as far as I know. Maybe dill won't be able to pickle dynamically generated classes, but the ability to pickle nested functions would still help out I think.

On Thu, Dec 5, 2013 at 11:12 AM, Matthew Rocklin notifications@github.comwrote:

@lidavidm https://github.com/lidavidm might have some wisdom to share here.

— Reply to this email directly or view it on GitHubhttps://github.com/pymc-devs/pymc/issues/413#issuecomment-29927576 .

lidavidm commented 10 years ago

@mrocklin We have an experimental PR that uses Dill, but the version that's currently deployed is still using pickle. I think dill has some trouble with a few things, but that's because of App Engine's restrictions.

I believe Dill is part of Pathos? @mmckerns would know.

mmckerns commented 10 years ago

@lidavidm: pathos stated out as a big package years back, and I broke it into several different independent modules. dill is the serializer. @jsalvatier: dill can serialize nested functions, lambdas, and dynamically generated classes (in almost all cases). The other modules that came out of the original pathos are: klepto, pyina, pox, and what was left stayed in pathos. @twiecki: pathos is not only a fork of multiprocessing that is dill-aware, but it also has a few other backends (including parallelpython and my own ssh-based pipes, maps, and auto-tunneling). pathos provides an abstract specification for pipes, maps, and queues (for both blocking and asynchronous calls). pyina is the same as pathos, but for MPI and different scheduler backends -- it leverages mpi4py and my own bindings to torque, slurm, and other similar technology. pox abstracts the filesystem, enabling abstract syntax across ssh-tunnels, for example -- it's the only one of the above that's at all stale. I use all of these packages in my research to provide heterogenous graph abstraction that can run petascale distributed parallel calculations. klepto is the newest package that I've abstracted out, and it's a caching and archiving tool -- providing in-memory caching (like python's lru cache) but with a archive backend that the cache can dump to when it gets full. As @twiecki and @mrocklin mentioned, I'm working on (re)releasing them. pathos (including dill, etc) has been around since 2003, and the "stable" releases of most are a bit stale. Actually, if you use pip, you have to use the pre option to get most of them, as I used odd naming convention in the past, and am working to rectify that now. The code here https://github.com/uqfoundation has the most recent versions of everything in my stable svn branch. I don't release my unstable svn branches yet because the stuff like the rpyc, pycuda, zmq, and other backends are too research-grade. I am currently converting all my code to python3, and releasing the stable stuff on github so it's pip installable. I expect this will take another two months or more.

Since this is pymc, I should say that my research is predictive science using large-scale optimization and uncertainty quantification. I have a parallel distributed MCMC, Bayesian, and some other quasi-monte carlo statistics stuff that might be of use to you guys -- most of it is in mystic, in a branch, but some is in the code on github. It's all leveraging pathos and dill and the like, however they are a minor part of the statistics stuff I rolled in mystic. I'm trying to release a full version of mystic in the next three months or so. I think map and pipe and queue are the right abstractions if you want to use them in pymc, that's what I use in my own, along with cache and archive and monitor (for streaming logging).

@lidavidm: I haven't forgotten about the support for sympy-live. I've added partial support for some of the things needed by App Engine… others, I still have to figure out how to best treat them in dill (either my problem or your packages). But dill pretty much has support for pickling the vast majority of python constructs. See the following: https://github.com/uqfoundation/dill/blob/master/dill/_objects.py (generally "a" and "d" pickle, while "x" doesn't) and https://github.com/uqfoundation/dill/blob/master/dill/dill.py (basically skim this to see what dill does).

mmckerns commented 10 years ago

By the way, I have to do a bunch of parallel MC development this coming year. I would love to see if there's some way we can make our codes work together, etc. If you want to discuss this, let's take it to email.

twiecki commented 10 years ago

@mmckerns Thanks for the summary! The abstractions make a lot of sense.

Certainly get in touch with @jsalvatier @fonnesbeck and/or me via email if you have ideas on integration; could also use the pymc mailing list.

mmckerns commented 10 years ago

Cool. I'll do that.

jsalvatier commented 9 years ago

We ended up refactoring to enable multiprocessing.