telegraphic / hickle

a HDF5-based python pickle replacement
http://telegraphic.github.io/hickle/
Other
485 stars 70 forks source link

support for python copy protocol __setstate__ __getstate__ if present in object #125

Closed hernot closed 3 years ago

hernot commented 4 years ago

[Suggestion]

I do have serveral complex classes which in order to be picled by different pickle replacements like jsonpickle and others implement __getstate__ and __setstate__ methods. Besides beeing copyable for free using copy.copy and copy.deepcopy pickling is quite straight forward.

import numpy as np
class with_state():
    def __init__(self):
         self.a=12
         sef.b = { 'love':np.ones([12,7]),'hatred':np.zeros([4,9]) }
    def __getstate__(self):
         return dict(
            a=self.a
            b = self.b
       )
   def __setstate__(self,state):
         self.a=state['a']
        self.b=state['b']

   def __getitem__(self,index):
       if index == 0 :
          return self.a
       if index < 2:
          return b['hatred']
      if index > 2:
          raise ValueError("index unknown")
       return b['love']

The above example is very simplified removing anyhting unnecessary. Currently these classes are hickled with the warning that object is not understood as a test whether __setstate__ __getstate__ is implemented is missing. Admitted both are handled by pickle fallback but class ends up as string instead of dataset making it quite tedious to extract from hdf5 file on non Python end like c# or other languages.

Therefore i do suggest to add a test for both methods defined and storing class as class state dictionary instead of pickled string. Would need some flag or other means which allows indicating that dict represents result of <class>.__getstate__ and not a plain python dictionary. Test should be run after testing for numpy data and before testing for python iterables as the above class appears to be iterable but isn't.

ADDENDUM: If somebody guides met through i would try the attempt to add appropriate test and conversion function. But I would need atleast guidance which existing methods would be best suitable for template and inspiration and what parts and sections to carefully read from h5py manual, hdf5 spec and other contribute to hickle documentation.

1313e commented 4 years ago

@hernot Thanks for the suggestion. In the example you posted above, the class has the dunder method __getitem__, which is a characteristic of an iterable. Every class that implements this dunder method is an iterable.

However, I am not sure how exactly you would like this class to be hickled. It is, as you say, not an iterable, and thus there is no way to represent it in an HDF5-file besides simply pickling it.

hernot commented 4 years ago

like pickle does, check if the class implements the __setstate__ and __getstate__ method from the python copy protocol and if it does obj.__getstfate__() is called. That call is required to return a dictionary containing every information required to rebuild obj and its state, given the receiver has access to the module exporting the objects class and the exported class itself. And on unhickling create the object by calling


   object_self = cls.__new__(cls)
   object_self.__setstate__(state)

with cls beeing the class for which the object has to be unhickled and state beeing the already unhickled dictionary which on hickling has been returned by call to obj.__getstate__()

The main difference between hickling plain python dictionary and a class implementing __getstate__ and __setstate__ methods is that the corresponding data group needs some metadata/attributes indicating that this is a dictionary describing the state and content of a class instance object of given class exported by named python module.

On unhickling a hdf5 group representing a dictionary it has to be checked if the metadata/attributes identifiy it as a state dictionary of an instance object of the specified class provided by the named module. If this is the case the module has to be imported and the object brought back to life using above lines.

At least hat would be the rough way i would do it. The tricky stuff as allways is hidden within the details.

One last thought. The __getstate__ and __setstate__ methods as part of the copy protocol provide the developer of the class a sort of control which parts of the object should be hickled at all, which are just artefacts required for initial constructoin an thus not needed when copying or pickling or unpickling an object. And they allow to pickle classe which implement sort of load on demmand semantics where not yet loaded data should not be loaded on pickling as it is an addtional informate or set of data not needed to acomplish targeted application.

hernot commented 4 years ago

@hernot Thanks for the suggestion. In the example you posted above, the class has the dunder method __getitem__, which is a characteristic of an iterable. Every class that implements this dunder method is an iterable.

Yes from outside view it is an iterable but its inner workings are different thus trying to hickle it as plain iterable would fail as would for copy.copy and copy.deepcopy and pickle.dimp and pickle.load and alike if the class would not define __getstate__ and __setstate__ methods ensuring proper state transfer.

1313e commented 4 years ago

Alright, I will have a look this weekend (hopefully).

hernot commented 4 years ago

INot sure what the code of conduct is in your team but if it would help i could make a fork and come up with a suggestion submitted through a pull request from my personal GH account. But for this i would need to know how this is expected to happen and what manuals and designdocuments etc to read and understand first before touching anything.

1313e commented 4 years ago

INot sure what the code of conduct is in your team but if it would help i could make a fork and come up with a suggestion submitted through a pull request from my personal GH account. But for this i would need to know how this is expected to happen and what manuals and designdocuments etc to read and understand first before touching anything.

Normally, a PR would be highly appreciated, which would basically consist out of adding the functionality and writing tests for the added functionality. However, I am currently in the process of rewriting large parts of hickle for the future v4.0.0 release (I should actually get back to that), so code written for v3 will likely be incompatible with the code for v4. I therefore opt for making these changes myself, such that I can assure that they will work in both versions. I hope that is okay with you?

hernot commented 4 years ago

Ok if in this process would be something i could help with let me know.

telegraphic commented 4 years ago

@hernot looks like @1313e has this under control, but FYI we do have a code of conduct, and a basic contribution guide.

If there are more specifics you think would be helpful for contributing, please let us know and we can improve the contribution guide.

hernot commented 4 years ago

Hm not sure if I'm helpfule byond this cause i did expext nothing else, given i consider my self rather experienced. The only thing i would add to contributors guide for Beginners is to read existing code and tests as more or less good best practice examples for structuring and formatting the code and formulating the unit tests.

But conderning me i think I'm fine.

1313e commented 4 years ago

@hernot Can you check if you can properly use the hickle version currently on the dev branch?

hernot commented 4 years ago

Sure. Any special things to check or just general verification? Should i just check if i can run hickle and get same results as for current release version? Or should that version already be able to understand __getstate__ __setstate__? Is testing with existing data set which i unhickle and rehickle sufficient? Or should i also test with new data set? in the latter case i would have to redo testing on Monday when back in office. Anything else should be doable tomorrow.

1313e commented 4 years ago

Check if you can use the version on the dev branch to properly dump and load the various complex classes you have made. It should be able to handle them properly now. I would say, go crazy with the testing until you either find a problem or are convinced it works properly. ;)

1313e commented 4 years ago

And, there is no hurry, as @telegraphic has to go through all the changes I made for v4 in #117 before it will be merged.

hernot commented 4 years ago

Hm get the following error with simple custom pytest trying to load file created with current release of hickle:

    def test_load_complex_hickle():
        starttime = time.perf_counter()
        testfile = glob.glob(os.path.join(datapath,filename))[0]
        print(testfile)
>       dataset = hkl.load(testfile)

<git-working-copies>/hickle/hickle/tests/test_stateful.py:20: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
<git-working-copyies>/hickle/hickle/hickle.py:616: in load
    py_container = _load(py_container, h_root_group['data'])
h5py/_objects.pyx:54: in h5py._objects.with_phil.wrapper
    ???
h5py/_objects.pyx:55: in h5py._objects.with_phil.wrapper
    ???
/usr/local/lib/python3.6/dist-packages/h5py/_hl/group.py:264: in __getitem__
    oid = h5o.open(self.id, self._e(name), lapl=self._lapl)
h5py/_objects.pyx:54: in h5py._objects.with_phil.wrapper
    ???
h5py/_objects.pyx:55: in h5py._objects.with_phil.wrapper
    ???
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

>   ???
E   KeyError: "Unable to open object (object 'data' doesn't exist)"

h5py/h5o.pyx:190: KeyError

dill version '0.3.1.1'

When i try to inspect root object simply iterating through items()

I get TypeError: Not a location id (invalid object ID)

Not sure whether I'm missing some addtional library besides dill, h5py and libhdf5 or if there is still some bug inthere.

EDIT Inspecting file with HDF5compass it seems group in root is called data_0 and not data.

see attached images about how HDF5 compass reads file

grafik grafik grafik

If you like i can provide the file to you, is nothing really fancy top secret inside.

NOTE The file itself was created by collecting all contents inside an anonymous dict using:

hickle.dump(dict( ... ),filename)

Before data was stored inside numpy.npz file where one can pass all variables as parameters to numpy.savez and numpy.savez_compressed. The anonymous dict was simply used to mimic this behaviour.

EDIT I think best is i provide you with the file, cause it uses 'data_0' instead 'data' for base group name and it uses 'type' instead of 'base_type' for group type attribute. All in all seems as if for version 4 a rename of attributes is undertaken in hickle or for 2.10.0 h5py whichever.

On debugging h_root_group.attrs["VERSION"] yields '3.4.6'

Loading of files from previous versions should be possible also in version 4 at least for testing purpose, cause my files are produced by productive code, where i do not like to change production environment.

ps.: Why are you using try catch for checking if version is a single number or Major.Minor.Patch format. why you are not simply using int('VER'.split('.')[0]) for checking major version number. In case VER is major only split will return list with single item, in case Major.Minor.Patch format list will have 3 entries and item 0 will be major. And the benefit is in future if some day Major starts to exceed 9 no nasty change is necessary to version check which currently assumes Major will for ever stay between 0 and 9.

1313e commented 4 years ago

@hernot As I mentioned earlier, release v4 will be incompatible with v3. Therefore, any file created with the current release cannot be opened with v4. What I meant with testing, was to dump and load your objects using the development version (which is v4).

And yes, loading files of previous versions will be possible in v4. I simply haven't added the legacy support yet, as I'm still testing and developing.

EDIT: I see now that I may have been a bit confusing here. I meant to say here that v4 itself is incompatible with v3 (therefore, v4 files cannot be read with v3 and vice-versa), but that it will be able to read v3 files by simply using the scripts of v3. It simply cannot do so just yet as I haven't added support for it yet.

hernot commented 4 years ago

You are kidding. Reading a file created by hickle >= 4.X with old hickle package <= 3.X, that this will not work is logic to me and the only Possibility to keep maintenance as low as possible. The same for hickle >= 4 not writing any thing else than V4 format. Anything else would be illogical.

But refusing reading V3 files with V4 to at least be able to recode/rehickle old V3 data to new V4 format so that they stay readable and usable is for me a no-go especially as my data is science data and i do have the constraint that data stored in any format has to be readable years after still. Or do you know any possibility any chance to have both versions installed alongside and import both versions alongside in one python script for converting V3 file into V4. I don't think that this is really possible.

In other words. The only reason why i would switch to hickle instead of dig into h5py directly is cause i can use it like pickle and unpickle but have the advantage that i can export content without any further change to file to c/c++/c#, perl and other languages as for all of them a binding for hdf5 library exists. And this is also my main reason that i requested support for __getstate__ __setstate__ protocol.

If reading V3 is not the long term goal, issue initially a deprecated waring or a warning that in future support for V3 and older hickle formats will be removed, and thus rehickling to V4 format and new is recommended. I do consider the more sensible solution.

1313e commented 4 years ago

I think you misunderstood what I meant, or I am just horribly confusing. Once released, v4 will be capable of reading any properly made v3 file, just as how v3 can read v2 files. I simply haven't added the support for reading v3 files to v4 yet (it can read v2 files though), as I am still doing lots of testing. It will be there for release.

However, any file made with v3 that was made incorrectly, like one that attempted to store objects of classes with the signature you posted here, will obviously not be able to be read with v4. The information was never stored correctly, and thus there is no way for us to figure out how to restore it properly. v4 itself however should be able to dump and load them just fine.

It is this last point that I was asking for you to check, to make sure I implemented it correctly: Can the dev branch version of v4 currently properly dump and load objects of the classes you wish to dump? If so, then that is great. If not, then I need to do some more fixing.

Sorry for the confusion. Hope everything is clear now. :)

hernot commented 4 years ago

hm not sure if I'm getting your right. The above file was created by dumping a dictionary with divers content using hickle 3.4.6 which was installed through pip install hickle for python 3.6. This hfile somehow seems to be broken cause of having incorrect structure. This would mean hickle version 3.4.6 does under certain circumstances not produce proper formated hickle documents/streams. But how can it be that hickle 3.4.6 is capable to read this file without any flaw. Or did i just not get you right this time again. Just want to understand it to if necedssary take now measures for avoiding hassels and problems in future. Even if that means now downgrading to last hickel version 3.X which produces properly formated files and also backconverting all files crated by 3.4.6 to last proper created fromat.

Which measures would you suggest to take in order to avoid unnecessary problems in migration from hickle 3.X to 4.X in future.

EDIT: I have rechecked my production code. 'data' is a valid key inside the dictionary dumped. Which caused one part of impression file would be not correct. Still dataset name is 'data_0' not 'data' is that correct and valid for files created by V3.X or should that even in proper V3.X Format be called 'data'? EDIT After some courious debugging it seems as if hickle 3.4.6 assumes any dataset to be an item within a list or other iterable structure even if obviously can not be as it is represents the object passed to dump it for dumping.

1313e commented 4 years ago

@hernot Any file that was made with v3 and can be properly loaded with v3, will be loaded properly with v4 as well, once v4 gets released. It simply cannot do so at this very moment, as I haven't added the loading scripts for v3 to v4 yet. This is temporary, and also the reason why v4 isn't released yet (I am not done with it just yet). So, you don't have to worry about the next hickle release not being able to read your v3 files anymore. It will be able to do so, as it would be stupid if it cannot (as you commented yourself as well). Therefore, no measures are required from your side to deal with the transition from v3 to v4 once it releases.

data_0 is indeed the name that is given to the first dataset in v3 files. In v4, I changed this to be just data in case there is only one dataset required (see #44).

hickle both does and does not assume that every dataset is an item within an iterable. Every dataset created by hickle carries several attributes, that are used to determine whether that dataset is part of an iterable, is an iterable itself, is both, or is neither of those things. In case an object is not understood by hickle, its representation is pickled, and therefore it will never be seen as an iterable (but it can still be part of one of course).

1313e commented 4 years ago

What I was asking for was, if you could check if you can both dump and load objects of classes that have the structure as you described in the first post, using v4 (the version that is currently on dev). v3 cannot handle them properly and it never will be able to do so either. Any file that was created with v3 that attempted to dump those objects, are lost forever, as neither v3 nor v4 will be able to recover the original object.

Just as a reminder: v4.0.0 is not released yet, which is why it is on the dev branch, and why you cannot find it on PyPI. I am simply asking you to test for me, if the features you requested in this issue, are present in the development version of v4. I already have implemented a test in hickle v4 that uses the code that you gave me in the first post, and it can dump and load objects made with that class just fine. However, as you mentioned that you have several classes that use that structure, which I assume are much more complicated, I would like to know if it still works fine for them as well.

hernot commented 4 years ago

hm ok what i can do is make atrificial stuff instead of testing with real data which i initially intended to. Not quite true in V3 (at least 3.4.6 does) anything hickle passeses everything it has no idea about through dill and simply stores resulting python pickle string, thus structure is not lust but can not be passed to other programs than python with pickle protocol >2. I report back to you if atrificial stuff appears in hd5 file in a fashion i would expect or if it is still passed through dill for generating pickle byte string and extracting everything thereof.

Will try to do tonight possibly.

EDIT Ok managed to get something not so artificial. hickle.dump and hickle.load seem to work eventhough when trying to open resulting h5 file in HDFCompass 0.6.0 i get some UnicodeDecodeError: 'utf8' codec can't decode byte 0x80 in position 0: invalid start byte but I'm not sure if this is really an error caus of some incaccuracies in hickle or rather a bug in HDFCompass as i do not get any such error when simply calling h5dump -H hickle_dump_test.h5 or h5dump -A hickle_dump_test.h5 or just h5dump hickle_dump_test.h5 on command line. So i consider the complaints of HDFCompass form now not related minors especially as it is running under Python2 and I'm running hickle under Python 3.6.

After fixing my outdated copies of my sources to best match production everything seems ok. But what i observe is that hickling tages about 5 min while the same seemed to be faster in V3. But this likely may be caused by running test using

python3 -m pytest --pdb --trace hickle/tests/test_stateful.py

Thus another minor for now especially as the data set hickled is about 400MB in hdf5 file size.

But when dumping the resulting V4 File still contains objects as pickle strings and not as hdf5 groups representing object state dictionary. Looking at the code neither __getstate__ nor __setstate__ are called yet. Can this be cause my git upstream settings are not proper

git branch
* dev
  master
git remote --verbose
origin  git@github.com:hernot/hickle.git (fetch)
origin  git@github.com:hernot/hickle.git (push)
upstream    git@github.com:telegraphic/hickle.git (fetch)
upstream    git@github.com:telegraphic/hickle.git (push)

Or is it simply that adding code for this request is still pending.

1313e commented 4 years ago

@hernot Thanks for checking that for me. Yes, it does currently store it in pickled form. The reason for that is because I first wanted to make sure that classes of this type can be handled properly by hickle, regardless of how it actually stores them. In v3, classes of this type would sometimes be serialized, either properly or not, or be stored incorrectly as an iterable. It was therefore important for me to get this working first.

Now that you have confirmed that it indeed works properly, I can take a look at storing the actual state in the HDF5-file and write a routine that can recreate the object properly. This may take some effort from my side, as there are also classes that implement the __getstate__ and __setstate__ methods that should not have their state saved specifically, like NumPy arrays.

When I have a solution for it, I will let you know, so you can maybe rerun the tests for me.

PS: Keep in mind that the __getstate__ and __setstate__ methods are meant to return an object that is pickleable. Therefore, attempting to pickle an object that has these methods defined will pickle the output of __getstate__ instead, plus storing information on what the class of the object is. hickle will therefore have to go out of its way to store it in normal form instead. This is documented here: https://docs.python.org/3/library/pickle.html#object.__getstate__

hernot commented 4 years ago

Unless you planned still major changes how things are handled i would suggest to simply keep the order of tests which are currently

1) test if numpy array or alike numpy item -> serialize as dedicated numpy dataset/group
2) test if dict item -> serialize as dict dateset
3) test if iterable sequence-> serializes either as array or if contains complex objects as dataset containing all items as subdatasets with counter appended to their name
4) otherwise save as pickled string

I simply would naively just insert the following test between step 2 and 3

getobjstate = getattr(obj,'__getstate__',None) 
if getobjstate is not None and getattr(obj,'__getstate__',None) is not None:

   # recall _dump on state with special attributes set for __class__ and __module__ and __loadstate__ which would read True unless returned state is false 

On loading i would check if 'loadstate' or alike attribute is present and if would

   if # group has attribute __loadstate__:
       # import module indicated by module attribute and create new instance of class by
       # instance = module.class.__new__(module.class) as done by copy.copy and copy.deepcopy
       if # __loadstate__ is None or __loadstate
            state = # load state value
            # call __setstate__ with loaded state value

At least that would have been my first naive approach. If it would have lead me anywhere i can't tell

1313e commented 4 years ago

Alright, so I have looked into all of the specifics and such, and I am finding it quite hard to figure out how this could be useful in more than a select few cases.

So, the way that pickle (and dill, which is what hickle uses) works is, in the absence of any methods that modify this behavior, that it gathers all the attributes of a class object in a dict. It then creates a tuple containing the location of the class definition and this dict, and then serializes the tuple. When unpickling, it creates an uninitialized instance of the class and adds all the attributes back to it. dill modifies this slightly to save the file location of the class definition as well, such that the definition does not necessarily need to be in the current namespace already. This is one of the reasons hickle uses dill and not pickle.

The problem with this is that pickle assumes that all attributes must be saved in order to uniquely identify the class instance. There are many cases however in which only a few attributes are necessary for this, as all other attributes can be derived from them. This is where the __getstate__ and __setstate__ dunder methods come in (and to a lesser extent the __reduce__ dunder method as well, which NumPy uses, but that one is way more complicated). It allows you to return a pickleable object with __getstate__ that is pickled instead of the dict I mentioned before. When unpickling, this object is unpickled and provided to __setstate__. It is meant to be used to make pickling/unpickling more efficient, by allowing you to save less information and having a method at your disposal that is called whenever the instance is unpickled (this could, for example, be used to read a file that contains all information required, instead of having to pickle that file's information).

Having said all of that, I think your request is the following: If a class instance has the __getstate__ and __setstate__ dunder methods, retrieve the state of the instance and hickle the state instead. If the state is something that can be hickled without having to serialize it, then it could be viewed in the HDF5-file. When unhickling, hickle must unhickle the state and provide it to the uninitialized class instance to recreate the instance again. This is what you are asking right?

If so, then here is my problem with it: It basically means reimplementing about everything that pickle already does. The sole exception is that we would hickle the class instance state instead of pickling it. However, the class definition itself would still need to be pickled anyway, as otherwise hickle cannot store the class definition. Given that when the __getstate__ and __setstate__ dunder methods are not present, pickle stores all attributes of the specific class instance in a dict, which is an object that hickle can handle, I feel like it would then also be expected that ALL class instances have their attributes dict hickled and not pickled.

The point here is that saving this attributes dict (or the state if it can be hickled properly without serializing it) as an actual dict in the HDF5-file, is almost never going to be useful. For example, the following is the (reduced) state of a NumPy array (you can read here what __reduce__ does exactly):

>>> import numpy as np
>>> array = np.array([1, 2, 3])
>>> array.__reduce__()
(<function numpy.core.multiarray._reconstruct>,
 (numpy.ndarray, (0,), b'b'),
 (1,
  (3,),
  dtype('int32'),
  False,
  b'\x01\x00\x00\x00\x02\x00\x00\x00\x03\x00\x00\x00'))

The first two items state what function must be called with what arguments to set the initial state of the array. The third item in the tuple above is the object that is provided to the __setstate__ dunder method of the newly created NumPy array to repopulate the object.

I hope you can agree with me that hickling the state of this very simple NumPy array (or its entire reduced state), is in no way more readable than just saving the individual values. Especially because the actual values stored in the NumPy array are provided in the last item of the last tuple (the b'\x01\x00\x00\x00\x02\x00\x00\x00\x03\x00\x00\x00' string), which is in serialized form anyway. The sole reason why we can view NumPy arrays in HDF5-files just fine, is because h5py converts NumPy arrays (and anything that can be converted to one) to the format that HDF5 requires. When reading the file, h5py converts the information back to a NumPy array (even if it wasn't one when the data was saved).

So, awaiting @telegraphic's opinion on this matter, I suggest you to do the following: Write a small function that specifically calls hickle on just the state of the class instance, and let it assign to an HDF5 group. In v4, I added the ability to have hickle work on open HDF5-files and groups, so you can add attributes to that group all you want. Then, when loading, you can unhickle the state and and reinitialize the class yourself. An attribute you could add to the HDF5 group is for example which class this hickled object is a state of.

hernot commented 4 years ago

See my above comment. And form my experience pickle or at least some alikes like jsonpickle does not pickle class declaration it self at all (afaik). And I too would not go that far and overdo things. I would start with a subset which can be handled by hickle with least changes.

In other words I do consider this request way more than fullfilled, if hickle would call __getstate__ and dump (by recalling _dump method) the resulting state object instead of the original object, additionally remembering that the content is the state of an object of type __module__.__class__ and thus has to try to import __module__ on loading and call __setstate__ upon a new not initialized instance of __module__.__class__ passing loaded state object to it for restoring.

The only two cases i would handle in addition would be state is None and isinstance(state,bool) and not state. The rest i would just document in the code for the case that somebody else would issue a related request asking for support of serializing class declarations by hickle. And even in that case i would priorize lowest such a request and first add support for custom serializing and deserializing methods which would be called if present for specific object if hickle has no idea how to handle and before simply creating pickle string.

1313e commented 4 years ago

@hernot I fully understand that that is what you would like to see, but the problem is that that is not expected behavior by the average user. The average user would expect that the class instance is reconstructed whenever they load a hickled object, regardless of what that class instance actually was. Therefore, hickle MUST save what the class definition was and reinitialize the class properly.

As I had stated above, if you want what you are asking for, you can do this quite simply using a few lines of code. Obtain the state of a class instance and dump that instead of the class instance.

hernot commented 4 years ago

Is likely. I used to typically approach such big seemingly complex problems by a stepwise approach. Instead of failing on the whole from the start, i would start with a subset in initial release and extend from there in future minor releases .

Step1: (in first release) add support of dumping class states and recreating instances by loading state into class instances created from classes imported from external third party modules.

Step2: (in one of the first minor releases) add support for dumping class definitions as such and restoring them as such independent from anything else. Thereby Allowing users to call hickle.dump(class,filename), class=hickle.load(file)

Step3: (in one of the minor releases following Step 2 release) try to restore class definition and declaration from dumped description and intialize its instance by stored state with fallback to importing from module stored else wehre as introduced in step 1.

Would imho render thingls a lot cleaner and better manageable. But that is the approache i would choose.

And calling __getstate__ and __setstate__ manually on each class instance as you suggest and than dump the resulting state object is not an option as i need to dump a > 1000 Elment Pythone list full list of objects which in their states host additional complex objects intermixes with numpy objects and basic python standard types. In other words this would reqire to reimplement hickle _dump and _load method to lower hirarchy to struture using Python basic types and numpy objects only before calling hickle.dump and restore object hirarchy again manulally after calling hickle.load. You see why i placed this request? I want to spare work you and other hickle developers already have done in a way better job than i likely would ever be able to acomplish.

hernot commented 4 years ago

(rotfl) (facepalm) (rotfl) (nirg) why I'm thinking so complicated.

Why invent the wheel again just lets do most of the work the builtin machinery setup by copy, pickle and copyreg.dispatch_tables and just walk the resulting reduced tuples. They contain all information which is required to reconstruct what user requested and expects back. Step by step starting from use cases which are available either cause built into python or cause having access to some real life or real life like examples using following approach (most of it is already implemented in hickle, some will need fine tuning and extension over time)

0) get copyreg.dispatch_table for all reductor methods registered. attention will also contain numpy
 default reductors which are to be simply ignored as handled otherwise.

1) check if object is numpy or alike -> store as dataset 
2) check if object.__class__ in copyreg.dispatch_table -> call registered method and store resulting
    tuple as restorable dataset (see below)
3) check if __reduce_ex__ is defined and call it with value >= 2 (value is pickle protocol number) to
    ensure lists tuples and dicts are reduced properly and store resulting tuple as restorable dataset
    (see below) 
4) check if object is dict -> create dict group and dataset and walk its items
5) check if iterable -> store as iterable
6) check if __reduce__ is defined and store resulting tuple as restorable dataset 
    (see below, this would cause TypeError upon dict, tuple and list type objects and some others not
     supported by pickle protocol 1)
7) something quite special not yet understood ask pickle (dill) for help 
8) item hickled return

dump_restorable(tuple,object):
1) store along with data __module__ and __name__ of restore method specified in first tuple entry (index 0)
2) check second entry in tuple: it contains a tuple which the arguments
    to be passed to restore method. for standard reconstructor method it is a tuple containing as first
   item the class of object, second the class object or tuple object representing base classes of class
   and None on my classes (not sure what it would contain if not None). In this case simply record
   __module__ and __name__ of all class objects mentioned. Anything else add support for handle
   when receiving appropriate use case examples.
3) check if len of tuple is >= 6 and if sixth entry is callable. if store  __module__ and __name__ of it. 
   It shall be called upon restore as replacement for `__setstate__` or instance __dict__.update()
    method
4) check if third entry is dict, if store its content as dict data set entries within current object group
   ( on restore pass its content to method specified by sixth tuple entry if present else to __setstate__
   if defined (skip __setstate__ call if content reads False) as a final fallback use content to update
   instance  __dict__ 
   If not dict either __setstate__ must be defined or sixth entry int tuple must be callable receiving
   content of state structure (try to store as iterable dataset if iterable or as simple data type otherwise
   for now abort and let dill to pickle whole object for now)  
5) check if fourth entry is iterator, if: than simple list/tuple like sequence. store its items  as iterable/list
   dataset entries of current object group (on restore pass to listlike extend method if defined or one
   by one to list like append if defined ) 
6) check if fifths entry is iterator. if: than dict like sequence store its items as dict dataset entries of
    current object group. ( on restore pass it to dict like update method if defined or assign one by one
    as key value pairs to dict like object )

Most thereof is anyway what is already implemented in hickle instead of manually figuring how to convert content into something serializable use existing copy and pickle machinery encoded in __reduce_ex_ __reduce__ and copyreg.dispatch_tables these return all information required to restore content. Only special handling would be required for now for numpy arrays.

And yes there remain still special or rather new cases they can be done when appropriate use cases are provided by users. I would meanwhile introduce some memoization of already hickled objects (like memo parameter of copy.deepcopy) which stores for each id(obj) the link object to the resulting HDF5 group or daset to be used when obj with same id should be referenced multiple times within same hierarchy. But that is a different request.

How could i miss that when reading description of pickle module and not just challenge it by just calling obj.__reduce_ex__(2) [].__reduce_ex_(3) etc. just for curiosity. Damit why!?!?! Why allways the most complicated approach first.

And concerning pickling of any classobjects, function objects etc the about last three paragraphs under the following link indicate how pickle handles them. They are basically non picklable and just stored as named or fully qualified named references. https://docs.python.org/3.8/library/pickle.html#what-can-be-pickled-and-unpickled

1313e commented 4 years ago

@hernot Although I appreciate all the thought you have put into it, my decision wasn't based on the difficulty of the implementation. Rather, it is based on the usefulness of the implementation. I had already made a plan on how to implement it if I wanted to (which did indeed involve copyreg). We will have to wait for what @telegraphic has to say about this.

telegraphic commented 4 years ago

@1313e and @hernot thanks, I'll need to find some free time to devote to this, clearly there are some subtleties that need to be considered.

I will say that this project started off with a limited set of supported types, and followed a pretty simple: "if this specific type, dump it to file using this function". At first glance this request seems in essence to ask for some level of duck typing (if it has __setstate__, then it quacks) -- certainly a lofty goal but may be hard to implement as @1313e has found.

1313e commented 4 years ago

I think the two of us had a pretty thorough discussion that you can read. :)

hernot commented 4 years ago

Hi just to catch up and not miss anything. Is there already a final Hickle 4.0.0 release which this could be added for future minor. Or is the close just cause of merge of linked pullrequest and general decisission on this topic independent of hickle 4.0.0 release still pending?

telegraphic commented 4 years ago

Ah, yes this was closed automatically when doing merge; I will reopen. Given the amount of discussion above I'd say this is certainly outside the scope of v4.0.0, which already has major changes.

To summarise my understanding of the issue: custom classes are currently pickled and stored as a serialised string. This is hard to read in other languages via HDF5 API.

I see three possible solutions:

I prefer the second or third approach (which may not be mutually exclusive)

1313e commented 4 years ago

I linked it, as I did solve the issue with classes implementing this structure not being able to be dumped and loaded properly. However, they are now fully serialized in order to achieve this.

@telegraphic My opinion on this matter is that something like this should not be added to hickle, for the reasons I discussed here: https://github.com/telegraphic/hickle/issues/125#issuecomment-641039286. I would like your opinion on this, but this was also partly the reason I kept the link.

hernot commented 4 years ago

@1313e @telegraphic i followed a bit the release process of Hickle 4.0.0 and played a bit how this could be added in the clean and lean concept of containers and datasets exporters and loaders. And i came to the conclusion that I do agree with you @1313e adding this brute force would just readd again highly specialized and difficult to maintain functions and classes to Hickle which Release 4 just got rid of.

But i think by pushing this special issue aside for the moment and asking the question how could that be implemented sticking to containers and datasets, exporters and loaders only and what would be required from Hickle side to achieve this i believe i stumbled over something which could be more general than just this topic here but for example also open a route to issue and #133 others.

Therefore i do suggest to 1) put this issue at hold or close for the moment, reopening is anyway possible at any instant 2) I wait until Hickle 4.0 is officially released 3) Think through the idea on the final release and if i consider it worth to be discussed I open an issue which allows to clearly identify what is current state, what would have to be changed, what could be the benefits for Hickle from it and how does it strengthen and improve the current design, document structure and so forth, a bit like eg. Python pep's. 4) If at the end it either turns out that you @1313e had already a similar suggestion already planned for future releases and it would be something to be in Hickle, than in a next step after that it should be a lot more straight forward and along the lines to re uptake this issue.

Till then I'm perfectly OK if this issue stays at halt or closed.

1313e commented 4 years ago

@hernot Sounds good, looking forward to it. v4 is currently in the final stages before release (@telegraphic was a bit too trigger happy and accidentally already merged v4 into master, so a last PR is required before release).

hernot commented 3 years ago

While working upon pullrequest #138 and a some preliminary proof of concept work for issue #139 i was forced to accept that hd5 data format is not really designed for storing dict, list and tuple structures containing vast ammount of heterogenous data especially if they are organized in a rather chaotic huge tree like structure. I summarized my finding within pull request #138. Consequently as any other solution yields more appropriate results compared to naive implementation of copy protocol support within hickle i hereby close this issue in favour for H4EP003 (issue #145).