mila-iqia / blocks

A Theano framework for building and training neural networks
Other
1.16k stars 351 forks source link

new pickler causes RuntimeError when pickling #737

Open dribnet opened 9 years ago

dribnet commented 9 years ago

The new main loop object pickler introduced in mila-udem/blocks#615 causes RuntimeError: maximum recursion depth exceeded on some downstream models. A current workaround is a custom Checkpoint subclass that skips pickling of the main loop object. So a minimal suggestion would be to offer an argument to the Checkpoint class to allow pickling of the main loop object to be similarly skipped.

dwf commented 9 years ago

The default recursion limit that Python sets is absurdly low for modern machines. Best way to work around it is to change the recursion limit (we provide a context manager for this in blocks.utils). On Jun 28, 2015 10:24 AM, "Tom White" notifications@github.com wrote:

The new main loop object pickler introduced in mila-udem/blocks#615 https://github.com/mila-udem/blocks/pull/615 causes RuntimeError: maximum recursion depth exceeded on some downstream models. A current workaround is a custom Checkpoint subclass that skips pickling of the main loop object. So a minimal suggestion would be to offer an argument to the Checkpoint class to allow pickling of the main loop object to be similarly skipped.

— Reply to this email directly or view it on GitHub https://github.com/mila-udem/blocks/issues/737.

rizar commented 9 years ago

Or you might want to put the following in .blocksrc:

recursion_limit: 100000

On 28.06.2015 20:03, David Warde-Farley wrote:

The default recursion limit that Python sets is absurdly low for modern machines. Best way to work around it is to change the recursion limit (we provide a context manager for this in blocks.utils). On Jun 28, 2015 10:24 AM, "Tom White" notifications@github.com wrote:

The new main loop object pickler introduced in mila-udem/blocks#615 https://github.com/mila-udem/blocks/pull/615 causes RuntimeError: maximum recursion depth exceeded on some downstream models. A current workaround is a custom Checkpoint subclass that skips pickling of the main loop object. So a minimal suggestion would be to offer an argument to the Checkpoint class to allow pickling of the main loop object to be similarly skipped.

— Reply to this email directly or view it on GitHub https://github.com/mila-udem/blocks/issues/737.

— Reply to this email directly or view it on GitHub https://github.com/mila-udem/blocks/issues/737#issuecomment-116299115.

dribnet commented 9 years ago

Thanks - but setting the recursion limit has no effect in my case. The core issue seemed to be that the main loop object gets loaded down with something that is at least an order of magnitude larger than the model itself. I confirmed this by writing a non-recursive version of Checkpoint (here) which doesn't have the RuntimeError. But unfortunately then the output file is very large and on my machine it then takes 30 minutes to read or write a single main loop object to disk (serialization was taking longer than training), so I scrapped that effort.

If you'd like more background on the code that triggers this, you can see jbornschein/draw#16 where it came up or the current proposed workaround at jbornschein/draw#18 which includes a Checkpoint subclass that simply skips pickling the main loop object.

dribnet commented 9 years ago

Actually my attempt used sys.setrecursionlimit() with absurdly high numbers (I think I tested up to 10,000,000) - would that be equivalent? I'll run a quick test setting it in instead in .blocksrc to verify.

dribnet commented 9 years ago

So when I put recursion_limit: 10000000 in my .blocksrc, the program prematurely halts at the same place - but now it aborts silently instead of with an exception. Admittedly strange - but I tested this a few times, so maybe it's now running out of memory or hitting some other error condition.

dwf commented 9 years ago

The way Python handles the stack is completely idiotic, unfortunately. If the recursion lmiit you set results in the stack growing beyond what the process can allocate on the underlying process stack, the interpreter just crashes.

So it sounds like whatever you're trying to pickle has a really, really deep object graph, or there is some sort of cycle that is tripping up our pickler.

On Sun, Jun 28, 2015 at 10:35 PM, Tom White notifications@github.com wrote:

So when I put recursion_limit: 10000000 in my .blocksrc, the program prematurely halts at the same place - but now it aborts silently instead of with an exception. Admittedly strange - but I tested this a few times, so maybe it's now running out of memory or hitting some other error condition.

— Reply to this email directly or view it on GitHub https://github.com/mila-udem/blocks/issues/737#issuecomment-116392553.

rizar commented 9 years ago

Can you please check that if you skip the custom persistent id machinery from blocks/serialization.py by passing cPickle.dump asd the dump function, pickling still needs enormous stack?

dmitriy-serdyuk commented 9 years ago

@dribnet , basically, the main loop contains three parts: log, iteration state, and model with parameters.

dribnet commented 9 years ago

@rizar - yes I've confirmed that the issue happens with cPickle.dump as well

@dmitriy-serdyuk - The model and log pickle fine, it's something else in the main loop that's blowing up. As mentioned in the issue, this strangely only happens for certain training parameters - so I don't think I'm saving the dataset.

If it's helpful, I can upload a large pickled main loop created with my custom non-recursive pickler later today and we can do an autopsy. But there's no urgency on my end since we have a pretty good workaround that pickles the model and log, but skips writing out the main loop itself.

rizar commented 9 years ago

yes I've confirmed that the issue happens with cPickle.dump as well

Weird!

I would appreciate if you try pickling all main loop attributes and tell us which of them causes infinite recursion. The main loop itself is a very light-weight class.

jmugan commented 8 years ago

I'm getting this too. I guess this is why the custom checkpoint was written for the machine_translation example?