Running multiple chains causes RecursionError

fonnesbeck commented 8 years ago

Setting the njobs parameter to run multiple chains results in an error:

---------------------------------------------------------------------------
RecursionError                            Traceback (most recent call last)
<ipython-input-59-548e16bedce3> in <module>()
      6 
      7 
----> 8     trace = sample(5000, njobs=2)

/Users/fonnescj/Github/pymc3/pymc3/sampling.py in sample(draws, step, start, trace, chain, njobs, tune, progressbar, model, random_seed)
    153         sample_args = [draws, step, start, trace, chain,
    154                        tune, progressbar, model, random_seed]
--> 155     return sample_func(*sample_args)
    156 
    157 

/Users/fonnescj/Github/pymc3/pymc3/sampling.py in _mp_sample(njobs, args)
    274 def _mp_sample(njobs, args):
    275     p = mp.Pool(njobs)
--> 276     traces = p.map(argsample, args)
    277     p.close()
    278     return merge_traces(traces)

/Users/fonnescj/anaconda3/lib/python3.5/multiprocessing/pool.py in map(self, func, iterable, chunksize)
    258         in a list that is returned.
    259         '''
--> 260         return self._map_async(func, iterable, mapstar, chunksize).get()
    261 
    262     def starmap(self, func, iterable, chunksize=None):

/Users/fonnescj/anaconda3/lib/python3.5/multiprocessing/pool.py in get(self, timeout)
    606             return self._value
    607         else:
--> 608             raise self._value
    609 
    610     def _set(self, i, obj):

/Users/fonnescj/anaconda3/lib/python3.5/multiprocessing/pool.py in _handle_tasks(taskqueue, put, outqueue, pool, cache)
    383                         break
    384                     try:
--> 385                         put(task)
    386                     except Exception as e:
    387                         job, ind = task[:2]

/Users/fonnescj/anaconda3/lib/python3.5/multiprocessing/connection.py in send(self, obj)
    204         self._check_closed()
    205         self._check_writable()
--> 206         self._send_bytes(ForkingPickler.dumps(obj))
    207 
    208     def recv_bytes(self, maxlength=None):

/Users/fonnescj/anaconda3/lib/python3.5/multiprocessing/reduction.py in dumps(cls, obj, protocol)
     48     def dumps(cls, obj, protocol=None):
     49         buf = io.BytesIO()
---> 50         cls(buf, protocol).dump(obj)
     51         return buf.getbuffer()
     52 

RecursionError: maximum recursion depth exceeded

hvasbath commented 8 years ago

Probably the problems occur because of the forking of the process as reported here: https://pythonhosted.org/joblib/parallel.html Bad interaction of multiprocessing and third-party libraries¶ One solution could be to use spawning instead for python 3.4 and above. However I am using 2.7 so there we would need another solution. One is suggested here https://github.com/Theano/Theano/wiki/Using-Multiple-GPUs for forking GPU processes. Propably this could be used for CPUs as well?

fonnesbeck commented 8 years ago

Thanks for the info. I've tried setting JOBLIB_START_METHOD='forkserver' which works in the sense of preventing a crash, but I start to see other errors:

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/Users/fonnescj/anaconda3/lib/python3.5/multiprocessing/process.py", line 254, in _bootstrap
    self.run()
  File "/Users/fonnescj/anaconda3/lib/python3.5/multiprocessing/process.py", line 93, in run
    self._target(*self._args, **self._kwargs)
  File "/Users/fonnescj/anaconda3/lib/python3.5/multiprocessing/pool.py", line 108, in worker
    task = get()
  File "/Users/fonnescj/anaconda3/lib/python3.5/site-packages/joblib/pool.py", line 359, in get
    return recv()
  File "/Users/fonnescj/anaconda3/lib/python3.5/multiprocessing/connection.py", line 251, in recv
    return ForkingPickler.loads(buf.getbuffer())
  File "/Users/fonnescj/Repositories/pymc3/pymc3/distributions/distribution.py", line 18, in __new__
    raise TypeError("No model on context stack, which is needed to "
TypeError: No model on context stack, which is needed to use the Normal('x', 0,1) syntax. Add a 'with model:' block
Process ForkServerPoolWorker-3:
Traceback (most recent call last):
  File "/Users/fonnescj/Repositories/pymc3/pymc3/model.py", line 113, in get_context
    return cls.get_contexts()[-1]
IndexError: list index out of range

I will read up on Theano's strategy as well. We really ought to get GPU multiprocessing going, however. Seems like low-hanging fruit.

hvasbath commented 8 years ago

One other thing that I noticed at least with the Text backend, which is a problem. backend = Text(name,model) initialises the backend object with the target folder in backend.name and the respective .csv file path in backend.filename (after backend.setup) where .df contains then the sampled trace values, after running sample like: trace = sample(draws,step,...) Now the BIG BIG BUT: The returned "trace" is still referring to the backend instance. With .df and .filename. If you do backend.df=None your trace.df will be None as well. Thats ok if you just run one chain. But if you run several chains especially when doing it serial each MultiTrace object is related to backend, because backend.setup(draws, chainnumber) only opens the csv file on disk but does not copy the backend basetrace object. So before each "sample" you need to reinitiallise the backend, instead of only doing repeated times backend.setup which is being done in _sample. Somehow in backend.setup the whole object needs to be copied. But I have no idea how to do this.

I just noticed it, because I wanted to run "sample" several times in a loop and collect all the traces in a list. Turns out all the traces had the same values. But on disk they are properly written. Then when loading the traces from disk the traces are properly loaded into several objects.

I have no idea about the other backends sqlite and ndarray. Apparently thats the issue with sqlite as well: https://github.com/pymc-devs/pymc3/issues/1008

grburgess commented 8 years ago

I just updated to the latest Theano 0.8 and pymc3 and this problem has disappeared for me. Strange thing though, while I build python manually with setup.py install, it still complained that it wanted Theano 0.7. The install seemed to go ok though.

hvasbath commented 8 years ago

yes for me it also wants to install theano 0.7 although I have the dev version thats somehow anoying, I simply disabled it in the setup script, although there must be a nice way.

twiecki commented 8 years ago

It's trying to pull 0.7 when you run pymc3's setup.py?

hvasbath commented 8 years ago

Yes it does.

grburgess commented 8 years ago

Yes, it seemed to install fine and use Theano 0.8, but it was rather confusing.

hvasbath commented 8 years ago

I have to abort it because when I let it install it, my import uses the 0.7 version instead of the dev version. They made sooooo many improvements in the current dev version so it is really significant to use the dev version.

twiecki commented 8 years ago

https://github.com/pymc-devs/pymc3/commit/f9de16e78c0116a8236611eeb1e840ecec67f41b should fix that.

hvasbath commented 8 years ago

Ah great thx!

grburgess commented 8 years ago

Fixed it. thanks!

springcoil commented 8 years ago

Is it time to shut this?

grburgess commented 8 years ago

I haven't done extensive testing, but on some high dimensional problems that originally threw the recursion error, the problem has disappeared. So perhaps for now it is solved. :)

twiecki commented 8 years ago

That sounds amazing. I'll close it but feel free to reopen if the problem persists with master pymc3 and theano.

jonsedar commented 8 years ago

Thanks for the recent bugfixes guys, also the updates to the build dependencies mean I'm now running theano: 0.8.0rc1 and either or both changes seem to have increased the theshold at which I was finding recursion errors.

EDIT: Okay, well - that does seem to have fixed it. I think I have a different bug though:

for a sufficiently complex model, the first time I create and sample it using njobs > 1, the processes start (I'm viewing in htop) and then they die without throwing an error
If I re-run the sampling then the processes seem to run fine.

I assume the difference in 2 is that the model is already cached. It's tricky to replicate though, a bit of a Heisenbug!

hvasbath commented 8 years ago

I also still get my segmentation faults - also with creating all the Text backends in advance...

vivek-hari commented 8 years ago

Oh! Really. Even with the latest pymc3 version, I am getting the same error with njobs=2.

multiprocessing.pool.MaybeEncodingError: Error sending result: '[<MultiTrace: 1 chains, 10 iterations, 2106 variables>]'. Reason: 'RuntimeError('maximum recursion depth exceeded',)'

    trace = pm.sample(n_samples, step=step_func, start=start, njobs=n_chains, progressbar=False)
  File "/home/user/.local/lib/python2.7/site-packages/pymc3/sampling.py", line 150, in sample
    return sample_func(**sample_args)
  File "/home/user/.local/lib/python2.7/site-packages/pymc3/sampling.py", line 282, in _mp_sample
    **kwargs) for i in range(njobs))
  File "/home/user/.local/lib/python2.7/site-packages/joblib/parallel.py", line 810, in __call__
    self.retrieve()
  File "/home/user/.local/lib/python2.7/site-packages/joblib/parallel.py", line 727, in retrieve
    self._output.extend(job.get())
  File "/usr/lib/python2.7/multiprocessing/pool.py", line 558, in get
    raise self._value
multiprocessing.pool.MaybeEncodingError: Error sending result: '[<MultiTrace: 1 chains, 10 iterations, 2106 variables>]'. Reason: 'RuntimeError('maximum recursion depth exceeded',)'

I have pymc3-3.0, numpy-1.11.0, Theano-0.8.1, scipy-0.17.0 installed. Anyone else facing the same issue in the latest version of pymc3?

fonnesbeck commented 8 years ago

By "latest pymc3 version" do you mean that you installed it from GitHub master? That is,

pip install -U git+https://github.com/pymc-devs/pymc3.git

vivek-hari commented 8 years ago

I installed using

pip install --process-dependency-links git+https://github.com/pymc-devs/pymc3

fonnesbeck commented 8 years ago

Make sure you use the -U flag or it may not update. I have not had this error since we closed this issue, so my first guess is that your update did not stick.

vivek-hari commented 8 years ago

Oh.. Thank you so much for your quick response. I'll update using -U flag and will get back. Thanks again!

vivek-hari commented 8 years ago

Sorry @fonnesbeck , installing pymc3 with -U also leads to same error. Even I removed all the packages(pymc3, numpy, scipy, theano) from my machine and tried fresh installation of pymc3 using pip install -U git+https://github.com/pymc-devs/pymc3.git. It also ended up in RuntimeError('maximum recursion depth exceeded',).

I have, Python 2.7.6, pymc3.0, matplotlib-1.5.1, joblib-0.9.4, numpy-1.11.0, pandas-0.18.0, patsy-0.4.1, pydot_ng-1.0.0, pyparsing-2.1.1,scipy-0.17.0, Theano-0.8.1 installed in my machine.

nvidia-smi toolkit gives following details, NVIDIA-SMI 346.96, Driver Version: 346.96, 4 GPU(0,1,2,3).

My .theanorc config is,

[global]
device = gpu
floatX = float32
assert_no_cpu_op = warn
[cuda]
root = /usr/local/cuda
[nvcc]
fastmath = True
[pycuda]
init = True

Is there anything else to be done?

twiecki commented 8 years ago

Perhaps the GPU utilization is at fault? Have you tried with CPU?

On Thu, Apr 21, 2016 at 9:04 AM, Vivek Harikrishnan Ramalingam < notifications@github.com> wrote:

Sorry @fonnesbeck https://github.com/fonnesbeck , installing pymc3 with -U also leads to same error. Even I removed all the packages(pymc3, numpy, scipy, theano) from my machine and tried fresh installation of pymc3 using pip install -U git+ https://github.com/pymc-devs/pymc3.git. It also ended up in RuntimeError('maximum recursion depth exceeded',).

I have, Python 2.7.6, pymc3.0, matplotlib-1.5.1, joblib-0.9.4, numpy-1.11.0, pandas-0.18.0, patsy-0.4.1, pydot_ng-1.0.0, pyparsing-2.1.1,scipy-0.17.0, Theano-0.8.1 installed in my machine.

nvidia-smi toolkit gives following details, NVIDIA-SMI 346.96, Driver Version: 346.96, 4 GPU(0,1,2,3).

My .theanorc config is,

[global] device = gpu floatX = float32 assert_no_cpu_op = warn [cuda] root = /usr/local/cuda [nvcc] fastmath = True [pycuda] init = True

Is there anything else to be done?

— You are receiving this because you modified the open/close state. Reply to this email directly or view it on GitHub https://github.com/pymc-devs/pymc3/issues/879#issuecomment-212774514

vivek-hari commented 8 years ago

Thanks @twiecki I will try with CPU and post my updates.

vivek-hari commented 8 years ago

device=cpu in .theanorc also raises RuntimeError('maximum recursion depth exceeded',).

vivek-hari commented 8 years ago

Below snippet is I am trying to execute.

import pymc3 as pm
import theano.tensor as T
import pandas

def tinvlogit(x):
    return T.exp(x) / (1 + T.exp(x))

pandas_df = pandas.read_csv("data.csv")

x_col1 = pandas_df['col1']
x_col2 = pandas_df['col2']
x_col3 = pandas_df['col3']
n_col3 = len(pandas_df['col3'].unique())

with pm.Model() as model:
        b_0 = pm.Normal('b_0', mu=0, sd=100)
        b_col1 = pm.Normal('b_col1', mu=0, sd=100)
        b_col2 = pm.Normal('b_col2', mu=0, sd=100)
        sigma_col3 = pm.HalfNormal('sigma_col3', sd=100)
        b_col3 = pm.Normal('b_col3', mu=0, sd=sigma_col3, shape=n_col3)

        for i in range(0, len(pandas_df)):
            p = pm.Deterministic('p', T.maximum(0, T.minimum(1, tinvlogit(
                b_0 + b_col1 * x_col1.at[i] + b_col2 * x_col2.at[i] + b_col3[x_col3.at[i]))))

        y = pm.Bernoulli('y', p, observed=pandas_df.y)

        start = pm.find_MAP()

        step_func = pm.NUTS()

        trace = pm.sample(5000, step=step_func, start=start, njobs=2, progressbar=True)

pm.sample fails with RuntimeError('maximum recursion depth exceeded')

pandas_df is pandas dataframe with columns col1(decimal), col2(decimal), col3(integer between 1-10), y(0 or 1) and has 50000 rows.

hvasbath commented 8 years ago

You get the recursion error because your graph will be very long as your loop will be running for 50k times, each time with all the nodes. Although I dont really get the purpose of your model I have the feeling you could vectorize it and get rid of the loop. The RVs have a shape parameter where you can simply create vectors of length of your data frame. The way you do it now p will be always overwritten and only the last sample of your dataframe will go into the cost. Or am I missing something?

pymc-devs / pymc

Running multiple chains causes RecursionError #879