jcmgray / xyzpy

Efficiently generate and analyse high dimensional data.
http://xyzpy.readthedocs.io
MIT License
67 stars 11 forks source link

Can't pickle ... #11

Open AdrianSosic opened 4 years ago

AdrianSosic commented 4 years ago

Hi @jcmgray, I'm currently using your awesome package to automate my experiments and noticed a problem related to pickling certain data types. While the cloudpickle backend of joblib should work fine to handle, for example, lambda functions, I get an error when working with certain modules based on torch.

Here is a minimal example:

import xyzpy
import botorch
import torch

@xyzpy.label(['model'])
def fun(a):
    x = torch.tensor([[0.]])
    y = torch.tensor([[0.]])
    return botorch.models.SingleTaskGP(x, y)

combos = dict(
    a=range(10)
)

h = xyzpy.Harvester(fun, 'result')
c = h.Crop('test')
c.sow_combos(combos)
c.grow_missing()
c.reap()

It produces the following error:

_pickle.PicklingError: Can't pickle <function _HomoskedasticNoiseBase.__init__.<locals>.<lambda> at 0x14cd89e18>: it's not found as gpytorch.likelihoods.noise_models._HomoskedasticNoiseBase.__init__.<locals>.<lambda>

Tested with Python 3.7.3 and

botorch==0.2.1
torch==1.6.0
xyzpy==1.0.0

After a short search, I found this related post: https://github.com/cornellius-gp/gpytorch/issues/907 A potential solution seems to be using dill instead of pickle. Do you think this option can be added to xyzpy?

For now, my workaround is to remove all problematic variables from the object returned by function to be evaluated after all internal computations have been completed. However, it would be much nicer, of course, if the objects could be naturally handled by xyzpy.

Kind regards, Adrian

jcmgray commented 4 years ago

Hi Adrian, thanks for the issue and glad xyzpy is being useful! It should be straightforward and seems useful to add a picklelib arg or something to Crop. I think the only functions called are dumps and loads.

Just as a quick first check you could try switching this line at the top of batch.py:

from joblib.externals import cloudpickle
# to -->
import dill as cloudpickle

and see if everything runs for you?

AdrianSosic commented 4 years ago

Hi @jcmgray, thanks for getting in touch. Unfortunately, your suggested change did not resolve the issue but raise the following error:

  File "/Users/M280152/Downloads/xyzpy/xyzpy/gen/farming.py", line 631, in Crop
    num_batches=num_batches)
  File "/Users/M280152/Downloads/xyzpy/xyzpy/gen/batch.py", line 226, in __init__
    self._sync_info_from_disk()
  File "/Users/M280152/Downloads/xyzpy/xyzpy/gen/batch.py", line 333, in _sync_info_from_disk
    farmer = None if farmer_pkl is None else pickle.loads(farmer_pkl)
ModuleNotFoundError: No module named '__builtin__'

Any thoughts on this?

jcmgray commented 4 years ago

OK that seems to be a separate problem - the farmer_pkl currently is pickled and unpickled by different libraries, which I am surprised currently works. That can be easily fixed.

The main problem is in fact not to do with pickling the function (what cloudpickle is currently used for), but using joblib.dump to write the result inside the grow function. Since I had assumed this to always be numeric types and arrays etc.

As an easier workaround than your current, you could simply pickle the return yourself:

    return dill.dumps(botorch.models.SingleTaskGP(x, y))

then unpickle on the other end.

And it might be nice to have this as a separate picklelib options as well.

AdrianSosic commented 4 years ago

Hi @jcmgray, I see. Is there a particular reason why you are using both cloudpickle and joblib instead of only one of them, i.e. would it be possible to also use dill (e.g. via setting an option) for the grow function?

In any case, am using your suggested solution at them moment as a workaround, which is indeed much smarter than simply throwing away the objects ;-)

Thanks a lot for your help! Much appreciated!

jcmgray commented 4 years ago

The reasoning was I think as follows:

  1. cloudpickle is specialised for saving functions (so is used just for the function), it has some overhead
  2. joblib is specialised for saving arrays (& can't process functions), which is what I generally had envisioned would be returned by the function!

This logic might not be necessary anymore, & I defo agree it would nice to be able to be able to customize which picklers are used.

I can try and add this at some point (unless you want to!), but it might not be immediately.