Joblib/multiprocessing copies parent process memory

marscher commented 8 years ago

The situation is that a user has already loaded a fairly large dataset into memory before he/she invokes a parallel estimation. Every time we perform a parallel computation, we will create a new process pool, which will memory map the parents memory. In modern operating systems (like Linux), this virtual memory mapping will only cause a copy of the base memory if the data is actually written.

However this breaks down in practice, since every time a numpy array is being passed somewhere its refcount is being increased, therefore the memory changes. One would assume, that the wrapped memory of the array is not touched by this refcounting, but somehow it is. This leads to dramatic increases of memory usage: n_jobs * (size of parent image) + overhead...

The workaround which I suggested to @gph82 is to create the process pool once before he loads his data and then use this pool in pyemma._base.estimator.estimate_param_scan.

However this is only a crude hack and we should think of sane solution here. It is obviously a bug in multiprocessing/joblib/numpy, which needs to be worked out upstream. We only have the choice to come up with a sane workaround.

I suggest to introduce a new config parameter "initial_pool_size", which could default to the amount of cores/cpus of the host and create this pool as soon as pyemma is imported. It is not nice, but will fix alot of MemoryErrors...

gph82 commented 8 years ago

Thanks for the solid 2h debugging session, @marscher. If we go this way, it would also be sane to check that the machine one is working on has at least the number of cpus that "initial_pool_size" wants to set, otherwise reduce it. Oder?

On 25.08.2016 17:14, Martin K. Scherer wrote:

The situation is that a user has already loaded a fairly large dataset into memory before he/she invokes a parallel estimation. Every time we perform a parallel computation, we will create a new process pool, which will memory map the parents memory. In modern operating systems (like Linux), this virtual memory mapping will only cause a copy of the base memory if the data is actually written.

However this breaks down in practice, since every time a numpy array is being passed somewhere its refcount is being increased, therefore the memory changes. One would assume, that the wrapped memory of the array is not touched by this refcounting, but somehow it is. This leads to dramatic increases of memory usage: n_jobs * (size of parent image) + overhead...

The workaround which I suggested to @gph82 https://github.com/gph82 is to create the process pool once before he loads his data and then use this pool in pyemma._base.estimator.estimate_param_scan.

However this is only a crude hack and we should think of sane solution here. It is obviously a bug in multiprocessing/joblib/numpy, which needs to be worked out upstream. We only have the choice to come up with a sane workaround.

I suggest to introduce a new config parameter "initial_pool_size", which could default to the amount of cores/cpus of the host and create this pool as soon as pyemma is imported. It is not nice, but will fix alot of MemoryErrors...

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/markovmodel/PyEMMA/issues/913, or mute the thread https://github.com/notifications/unsubscribe-auth/AHK3NB3J1DqhWkjurSsOMvQqqQDFlDAcks5qjbE5gaJpZM4JtLkg.

Dr. Guillermo Pérez-Hernández Freie Universität Berlin Institute for Mathematics Arnimallee 6 D-14195 Berlin tel 0049 30 838 75775

http://userpage.fu-berlin.de/gph82/

marscher commented 8 years ago

one could also try to configure the multiprocessing package to use a fork_server, which will spawn a server which will then serve minimal resource processes as workers.

marscher commented 8 years ago

https://docs.python.org/3/library/multiprocessing.html#contexts-and-start-methods

marscher commented 8 years ago

however multiprocessing is wrapped by joblib itself

franknoe commented 8 years ago

I have mostly bad experiences with python multiprocessing, and would probably avoid it in general (at least PyEMMA shouldn't use it by default). I'm not sure what's the exact context you have tried it on, but apart from the fact that is sometimes crashes or blocks, in some cases it can be slower than serial execution. My feeling is that this happens mainly when there is a lot of data associated with the serialized object, such that a lot of work goes into allocating new memory and serializing/deserializing objects.

In order to work efficiently with multiprocessing, I think estimator objects need to be designed "lean", which is not something we have done. As an example, consider the MaximumLikelihoodMSM, which takes the discrete trajectories as an input and keeps a reference to them. Multiprocessing this object means that for every process we make a full copy of the discrete trajectories and pay the prize for serialization/deserialization, and as a result of that it may take longer than just run multiple computations on one compute core. In contrast, if you would not use the discrete trajectories but rather the (usually sparse) count matrix as an input, then the memory and (de)serialization overhead is usually much smaller, and consequently parallelization is more efficient.

So should we do this for every estimator? That's hard, because it's not very transparent what happens here, and for some estimators (e.g. TRAM) I don't know a reduced representation of the data which would make the estimator leaner. My feeling is it would be more efficient if the low-level routines are parallelized, and then I would just not bother to parallelize the high-level routines. That means: we only have a few key algorithms that take most of the computation time when doing multiple estimates or Bayesian stuff:

computing covariance matrices.
reversible maximum likelihood of transition matrix.
reversible Bayesian sampling of transition matrix.

I bet these three algos explain already most of the waiting time. 1. should in principle be parallelized already if we have the right numpy installation - but I have had cases where this was not given, and I'm not sure how to enforce it in a practical and safe way. For 2.+3. we could write an OpenMM parallelization that just processes the matrices in blocks or stripes.

Even more generally speaking - I'm a big fan of finding better algorithms rather than spending much time on parallelization. For practically everything we do right now, I'm sure we can find algorithms with better complexity, e.g. replacing an order N by an order log N here and there. For example, it seems that we have an efficient way to get an equally good estimator for 2. above without going through an expensive iteration. More of that discussion when I'm back in Berlin.

gph82 commented 8 years ago

Hi, IDK about the (some) technical issues, and I am all for better algos (who isn't?). I can report that I tested this branch (timescales(**, n_jobs=5) and confirmed the speedup is significant! Thanks @marscher

marscher commented 8 years ago

@franknoe: Some assumptions you've made are not true. For instance the need to serialize things in multiprocessing only holds for platforms, which do not provide the process-"fork" method. This method is available on all Unix systems, so serialization overhead only comes into play for Windows systems. The fork method creates a virtual address mapping of the process being forked from and only copies the memory, if and only if the memory is written. So creating sub-processes this way is in theory not harmful and very quick.

At the moment we are only using MP for embarrassingly parallel stuff, like evaluating the timescales of N models, which can not be done any quicker by finding a better algorithm (well I can not exclude that, but finding a method would be pretty hard, since it has already been optimized for decades).

marscher commented 8 years ago

https://stackoverflow.com/questions/26025878/what-is-being-pickled-when-i-call-multiprocessing-process

franknoe commented 8 years ago

Am 30/08/16 um 14:04 schrieb Martin K. Scherer:

@franknoe https://github.com/franknoe: Some assumptions you've made are not true. For instance the need to serialize things in multiprocessing only holds for platforms, which do not provide the process-"fork" method. This method is available on all Unix systems, so serialization overhead only comes into play for Windows systems. The fork method creates a virtual address mapping of the process being forked from and only copies the memory, if and only if the memory is written. So creating sub-processes this way is in theory not harmful and very quick.

Hmm, I have had very bad experiences on Mac, which is a Linux system.

At the moment we are only using MP for embarrassingly parallel stuff, like evaluating the timescales of N models, which can not be done any quicker by finding a better algorithm (well I can not exclude that, but finding a method would be pretty hard, since it has already been optimized for decades).

I agree if it works reliably parallelizing embarrassingly parallel stuff is the way to go, but we should really test this on a few systems, because I have even had situations where it becomes slower or completely freezes the system.

Prof. Dr. Frank Noe Head of Computational Molecular Biology group Freie Universitaet Berlin

Phone: (+49) (0)30 838 75354 Web: research.franknoe.de

markovmodel / PyEMMA

Joblib/multiprocessing copies parent process memory #913

http://userpage.fu-berlin.de/gph82/

Mail: Arnimallee 6, 14195 Berlin, Germany