lisa-groundhog / GroundHog

Library for implementing RNNs with Theano
BSD 3-Clause "New" or "Revised" License
598 stars 229 forks source link

Possible to save compiled graph to disk? #33

Closed ehasler closed 8 years ago

ehasler commented 9 years ago

Hi,

I have set up an RNN using groundhog which has a larger graph than the original RNN-search model and it takes quite a long time to compile (depending on the settings, more than an hour). I don't assume it's already implemented, but is it possible to save the compiled graph to disk somehow? I am running on a cluster with a 12 hour time limit, so it's a bit wasteful to recompile on every restart.

Cheers, Eva

nouiz commented 9 years ago

Hi,

Update Theano to the development version. There have been update that could help you. Are you sure the time is at compilation and not during the theano.grad() call? theano.grad() have been speed up since the last release.

You can pickled the output of theano.function(). By default, it will re-compile everything when you unpickle, but you can use this Theano flag to skip part of the re-compilation:

http://deeplearning.net/software/theano/library/config.html#config.reoptimize_unpickled_function

But if you update Theano, you need to redump Theano to get all optimization if you use that flag.

On Tue, Apr 28, 2015 at 12:13 PM, Eva Hasler notifications@github.com wrote:

Hi,

I have set up an RNN using groundhog which has a larger graph than the original RNN-search model and it takes quite a long time to compile (depending on the settings, more than an hour). I don't assume it's already implemented, but is it possible to save the compiled graph to disk somehow? I am running on a cluster with a 12 hour time limit, so it's a bit wasteful to recompile on every restart.

Cheers, Eva

— Reply to this email directly or view it on GitHub https://github.com/lisa-groundhog/GroundHog/issues/33.

rizar commented 9 years ago

@nouiz , I am not sure that True is the right default value for this configuration option. I think most people expect unpickling to return verbatim what was pickled. Also I believe many people (at least me) did not know about this option altogether and suffered a lot from waiting for their function to be unpickled.

That hurts pretty badly all Blocks users, because in Blocks everything is saved in one pickle file: model parameters, theano functions for optimizations, training log. In order to explore one of these things people have to wait until the functions are recompiled.

@bartvm, did you know about http://deeplearning.net/software/theano/library/config.html#config.reoptimize_unpickled_function?

bartvm commented 9 years ago

I imagine that with that turned off by default you might have a lot of issues when unpickling when the environment has changed e.g. CuDNN not available, moved from GPU to CPU, etc. would be good reasons to make True the default.

That said, I agree that intuitively I--and I think many people--expect pickling to simply be the serialization and deserialization of objects. So after unpickling, I want the exact object I pickled back as far as that is possible. Python's pickling is extensible so that you can implement serialization support for non-standard objects, and I agree with @rizar that it's not good practice to actually change the serialized objects unless absolutely necessary i.e. by reoptimizing and recompiling graphs. I think it would be better to have separate saving/loading functions that are especially for Theano objects (like NumPy's save, savez, etc.) instead of trying to shoehorn all this functionality into Python's pickling protocol.

nouiz commented 9 years ago

I'm open to discuss the change of the default value. But we should we do exactly if the environment changed (like CPU or GPU added/removed?).

If the GPU is removed, should re recompile, warn, raise an error? If GPU is added?

Also, if people update Theano, I was expecting people would think we would apply the new optimization? But you seam your expectation is different. Can both of you confirm your expectation is that we do not reoptimize if Theano get updated? If so, we could discuss on Theano mailing list about changing the default.

About special function instead of pickling, the only advantage I see is pass parameter in a different way then using Theano flags. We could wrap this in helper function if you think it is useful, but I don't see much a difference from this. Did you had other reason to suggest not using pickle?

On Tue, Apr 28, 2015 at 1:47 PM, Bart van Merriënboer < notifications@github.com> wrote:

I imagine that with that turned off by default you might have a lot of issues when unpickling when the environment has changed e.g. CuDNN not available, moved from GPU to CPU, etc. would be good reasons to make True the default.

That said, I agree that intuitively I--and I think many people--expect pickling to simply be the serialization and deserialization of objects. So after unpickling, I want the exact object I pickled back as far as that is possible. Python's pickling is extensible so that you can implement serialization support for non-standard objects, and I agree with @rizar https://github.com/rizar that it's not good practice to actually change the serialized objects unless absolutely necessary i.e. by reoptimizing and recompiling graphs. I think it would be better to have separate saving/loading functions that are especially for Theano objects (like NumPy's save, savez, etc.) instead of trying to shoehorn all this functionality into Python's pickling protocol.

— Reply to this email directly or view it on GitHub https://github.com/lisa-groundhog/GroundHog/issues/33#issuecomment-97150339 .

rizar commented 9 years ago

I think that by default pickling should crash when environment has changed. Observing this crash, the user will google the error message and learn, that in order to unpickle in a changed environment or in order to apply new optimization he should set reoptimize_unpickled_function=True. Or even better, the crash message can contain a hint to look at this option. Also when the Theano version has changed, the pickler can raise a warning that perhaps recompiling would speed up the execution of the unpickled function.

@bartvm , your idea to move this functionality into another function is very clean from the design perspective, but I am afraid a bit non-practical. It is highly convenient here to be able to change the behavior without changing the code, which is what environment variables do.

rizar commented 9 years ago

A small addition to my post above: if we decide to keep the option True by default, we can however add a warning so that people knew what happens:

WARNING: your function is recompiled. If you want to use the pickled function as is, set
'recompile_unpickled_function=False'
ehasler commented 9 years ago

Hi,

theano.grad() takes about 7 mins, but most of the time is spent in the "Compiling grad function" step. I will try pickling the output of theano.function() for that step

Thanks!

nouiz commented 9 years ago

Eva Hasler, what do you think the default shoud be? What did you expected first?

On Wed, Apr 29, 2015 at 5:30 AM, Eva Hasler notifications@github.com wrote:

Hi,

theano.grad() takes about 7 mins, but most of the time is spent in the "Compiling grad function" step. I will try pickling the output of theano.function() for that step

Thanks!

— Reply to this email directly or view it on GitHub https://github.com/lisa-groundhog/GroundHog/issues/33#issuecomment-97368714 .

ehasler commented 9 years ago

I guess I would expect the default to be False and set it explicitly to True to reoptimise. On the other hand, if a warning is issued as suggested above, it would probably be fine to have True as the default

ehasler commented 9 years ago

I have updated Theano to the development version but now the pickling of the training function does not work anymore

File "~/code/GroundHog/groundhog/trainer/SGD_adadelta.py", line 157, in init cPickle.dump(self.train_fn, f_out) RuntimeError: maximum recursion depth exceeded while pickling an object

With version 0.6.0 the pickling seems to work, but when I use the unpickled function I get a shape mismatch error.. Any idea what I could be doing wrong?

rizar commented 9 years ago

Python default recursion limit is indeed very low. Try sys.setrecursionlimit(10000).

nouiz commented 9 years ago

I have open the discussion for the default value:

https://groups.google.com/d/msg/theano-users/sNWVUFtkqok/2-Ht7WmDSUsJ

On Wed, Apr 29, 2015 at 1:22 PM, Dmitry Bogdanov notifications@github.com wrote:

Python default recursion limit is indeed very low. Try sys.setrecursionlimit(10000).

— Reply to this email directly or view it on GitHub https://github.com/lisa-groundhog/GroundHog/issues/33#issuecomment-97510699 .

ehasler commented 9 years ago

ok, pickling works when I change the recursion limit (and with Theano dev), but unpickling does not:

self.train_fn = cPickle.load(f_in) ... File "~/tools/Python-2.7.6/BUILD/lib/python2.7/site-packages/Theano-0.7.0-py2.7.egg/theano/gof/op.py", line 730, in make_thunk if self._op_use_c_code: AttributeError: (AttributeError('The following error happened while compiling the node', Gemm{no_inplace}(<TensorType(float32, matrix)>, TensorConstant{1.0}, <TensorType(float32, matrix)>, R_back_enc_transition_0_copy0, TensorConstant{1.0}), '\n', "'Gemm' object has no attribute '_op_use_c_code'")

rizar commented 9 years ago

I would recommend to make sure that you are not having troubles with outdated theano cache by running theano-cache purge.

nouiz commented 9 years ago

If that do not fix it, tell me. I know a way to fix it.

Replace that line by:

if getattr(self, '_op_use_c_code', theano.config.cxx)

but if the clear of the cache work, that is better.

On Wed, Apr 29, 2015 at 2:20 PM, Dmitry Bogdanov notifications@github.com wrote:

I would recommend to make sure that you are not having troubles with outdated theano cache by running theano-cache purge.

— Reply to this email directly or view it on GitHub https://github.com/lisa-groundhog/GroundHog/issues/33#issuecomment-97530782 .

ehasler commented 9 years ago

Hi,

clearing the cache did not help, but changing that line in the code does. So with the original version of RNN-search I can pickle and unpickle the training function with Theano 0.7.0. However, I still have a problem with pickling when using my modified network. It compiles and runs fine without pickling, but using the pickled function gives me this error: ... File "/home/ech57/tools/Python-2.7.6/BUILD/lib/python2.7/site-packages/Theano-0.7.0-py2.7.egg/theano/scan_module/scan_op.py", line 777, in rval r = p(n, [x[0] for x in i], o) File "/home/ech57/tools/Python-2.7.6/BUILD/lib/python2.7/site-packages/Theano-0.7.0-py2.7.egg/theano/scan_module/scan_op.py", line 766, in self, node) File "scan_perform.pyx", line 356, in theano.scan_module.scan_perform.perform (/home/ech57/.theano/compiledir_Linux-2.6-el6.x86_64-x86_64-with-redhat-6.6-Carbon-x86_64-2.7.6-64/scan_perform/mod.cpp:3605) File "scan_perform.pyx", line 350, in theano.scan_module.scan_perform.perform (/home/ech57/.theano/compiledir_Linux-2.6-el6.x86_64-x86_64-with-redhat-6.6-Carbon-x86_64-2.7.6-64/scan_perform/mod.cpp:3537) ValueError: Shape mismatch: x has 1 rows but z has 2 rows Apply node that caused the error: Gemm{no_inplace}(<TensorType(float32, matrix)>, TensorConstant{1.0}, <TensorType(float32, matrix)>, G_dec_transition_0_copy01, TensorConstant{1.0}) Inputs types: [TensorType(float32, matrix), TensorType(float32, scalar), TensorType(float32, matrix), TensorType(float32, matrix), TensorType(float32, scalar)]

So for some reason I am not getting back the original function.. is this an indicator that there is something not quite right with my training function?

ehasler commented 9 years ago

Actually, the speedup with the new version of theano is quite impressive, compiling the function takes now about 13 mins instead of an hour :) so maybe recomputing it is ok after all

ehasler commented 9 years ago

Hi,

just to be sure, is there some mechanism in the pickling of a theano function that assumes a certain kind of network (and that's why it fails with mine)? If that's the case, I probably won't bother, but if not I should find out why it is going wrong

rizar commented 9 years ago

As far as I know any Theano function should be picklable. Maybe that's a bug that you stumbled upon, in which case any effort to localize, that is to find what change of the computation graph makes your function unpicklable, would be appreciated :)

nouiz commented 9 years ago

Could do you a git bisect to find the commit that introduced the problem?

On Fri, May 1, 2015 at 3:29 PM, Dmitry Bogdanov notifications@github.com wrote:

As far as I know any Theano function should be picklable. Maybe that's a bug that you stumbled upon, in which case any effort to localize, that is to find what change of the computation graph makes your function unpicklable, would be appreciated :)

— Reply to this email directly or view it on GitHub https://github.com/lisa-groundhog/GroundHog/issues/33#issuecomment-98213064 .

nouiz commented 8 years ago

I'll close this issue as there is no activity.