telegraphic / hickle

a HDF5-based python pickle replacement
http://telegraphic.github.io/hickle/
Other
485 stars 70 forks source link

Implementaion of Container and mixed loaders (H4EP001) #138

Closed hernot closed 3 years ago

hernot commented 4 years ago

At first:

@1313e with this pull request i want to express how much i apriciate your really great work done for hickle 4.0.0 implementing the first step to dedicated loaders.

Second: the reason why I'm so pushing upon implementation of H4EP001

The research conducted by the research group I'm establishing and leading is split into two tracks. A methodological one dealing with improvement and development of new algorithms and methods for clinical procedures in diagnostics and treatment. The second one is concerned with clinical research utilizing the tools based upon the methods and algorithms provided by the first track.

In the first track python, numpy, scipy etc. are the primary tools for working on the algorithms, investigating new procedures and algorithmic approaches. The work in the second track is primarily conducted by clinicians. Therefore the tools provided for their research and studies have to be thoroughly tested and validated. This validation at least the part which can be automatized through unit test utilizes test data, including intermediate data and results obtained and provided by the python programs and scripts developed during development of underlying algorithm.

As the clinical tools are implemented in compiled languages which support true multi-threading the data passed on has to be stored in a file format readable outside python, out-ruling pickle strings. Therefore jsonpickle was used to dump the data. Meanwhile the amount of data has grown into the large so that json files even if compressed using zip, gzip or other compression schemes is not feasible any more. NPY, and NPZ files which was the next choice mandate a dependency upon numpy library. Just for conducting unit tests a self contained file format for which only the corresponding library without any further has to be included would be the better choice.

And this is the point where hdf5libraries and hickle come into play. I do consider both as the best and most suitable option i have found so far. And the current limitation that objects without dedicated loader are stored as pickle strings can be solved by supporting python copy protocol. Which i offer hereby to contribute to hickle.

Third content of this pull-request:

Implementation of Container based and mixed loaders as proposed by #135 hickle extension proposal H4EP001. For details see commit-message and the proposal #135.

Finally i do recommend:

Not to put this into an official release. Some extended tests using a real dataset compiled for testing and validating software tools and components developed for use in clinical track showed that an important part is missing to keep file sizes at reasonable level. Especially type-strings and pickle strings for class and function objects currently take-up most of the file space letting dumped files grow quickly into GB even with hdf5 file compression activated where the pickle stream just requires 400MB of space. Therefore i do recommend to implement additionally memoization (H4EP002 #139 ) first before considering the resulting code base ready for release.

Ps.: loading of hickle 4.0.0 files should be still possible out of the box. Due to the lack of an appropirate testfile no test is included to verify.

codecov[bot] commented 4 years ago

Codecov Report

Merging #138 (d4ef711) into dev (3b5efbb) will not change coverage. The diff coverage is 100.00%.

Impacted file tree graph

@@            Coverage Diff            @@
##               dev      #138   +/-   ##
=========================================
  Coverage   100.00%   100.00%           
=========================================
  Files            9         9           
  Lines          592       654   +62     
=========================================
+ Hits           592       654   +62     
Impacted Files Coverage Δ
hickle/__version__.py 100.00% <100.00%> (ø)
hickle/helpers.py 100.00% <100.00%> (ø)
hickle/hickle.py 100.00% <100.00%> (ø)
hickle/loaders/load_astropy.py 100.00% <100.00%> (ø)
hickle/loaders/load_builtins.py 100.00% <100.00%> (ø)
hickle/loaders/load_numpy.py 100.00% <100.00%> (ø)
hickle/loaders/load_scipy.py 100.00% <100.00%> (ø)
hickle/lookup.py 100.00% <100.00%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more Δ = absolute <relative> (impact), ø = not affected, ? = missing data Powered by Codecov. Last update 3b5efbb...d4ef711. Read the comment docs.

telegraphic commented 4 years ago

I'll wait for some input from @1313e, I only note two small things:

hernot commented 4 years ago

I'm thinking of trying to get hands on this two issues the next days while we are waiting for input for @1313e . Not sure if i will manage or fail.

Another thing i have checked the two other pullrequest #136 and #137. They are basically the same two minor adaptions of load_astropy.py loader and test_astropy.py test. In case you and @1313e agree i would also include in these efforts to implicitly include the ammendments of the two files. Especially when adding anyway a test for verifying that loading of 4.0.0 files is not broken by the proposed changes. What do you think.

1313e commented 4 years ago

136 and #137 exist, because I specifically asked for 2 PRs (I still have to merge them).

They will come before this one gets merged, so some additional merging will be required.

I have just finished an incredibly busy period, so I now finally have time to go and take a look at this. I will hopefully be able to do so this week.

hernot commented 4 years ago

@1313e: So shall i wait with the ammendments for the not covered line and the implementation of the not yet included verification for proper loading of 4.0.0 file too or would you prefer that these ammendments are done before reviewing.

1313e commented 4 years ago

You can add them just fine, as long as they have nothing to do with #136 and #137, as you will require merging the master branch into your branch before this PR can be accepted.

1313e commented 4 years ago

There you go, now those PRs are merged. I see that there are some conflicts, so they will have to be resolved first.

hernot commented 4 years ago

I do propose the following approach

1) I will first fix the not covered line and come up with an appropriate test for testing proper loading of 4.0.0 files. 2) If 1 is stable try to rebase and resolve the conflicts that pr is again non conflicting to dev.

Anything I'm up to missing anything and thus should consider and cover additionally?

1313e commented 4 years ago

You first want to merge master or dev (both are the same) into your branch, and then fix everything else.

hernot commented 4 years ago

Ok think rebase worked. @1313e thank you very much for reminding.

hernot commented 4 years ago

Think i managed. Full coverage and proper loading of hickle 4.0.0 and 4.0.1 files. Added file created with master and pickled version of the same data which test uses to verify that data were properly restored. Hope i did not miss any tricky obstacle. Real world use will reveal. Fuck numpy and pickle as allways and why 3.8 breaks i have no idea as in 3.6 and 3.7 it works. for today its already too late to solve. dropping the pickle file could be possible as the script creating it is also there but. the error in 3.8 sigh.

1313e commented 4 years ago

@hernot Having fun over there? ;)

hernot commented 4 years ago

not any more one last squashing and things are ready, just hat to crawl down a tight rabit hole to find the one nasty error occuring in python 3.8. And as i do not have here a native Python 3.8 installation here available i have to abuse Tarvis a bit for this.

The error seemed to be caused by the fact that load_numpy.create_ndarray_dataset creates on lines 102-105 a 'type' string from lambda

            # If not, create a new group and dump py_obj into that
            d = h_group.create_group(name)
            _dump(py_obj, d, **kwargs)
            d.attrs['type'] = np.array(pickle.dumps(lambda x: np.array(x[0])))

which is not used any more in the new container based scheme. And further according to Python doc Section What can be pickled and unpickled not supported and seems to be enforced now in python3.8 or is just in 3.8 unluckily triggered by the new test verifying that hickle 4.0.0 and 4.0.1 files are proper loaded without the need for an additional dedicated legacy loader.

I will cleanup the code and squash the intermediate commits and repush one final time when done.

hernot commented 4 years ago

Ok final cleanup and squashing done.

NOTE: For testing purpose and thus exeptional form usual release procedure I bumped the minor version number to 4.1.0. This allowed me to properly test that when loading files created by hickle 4.0.x the related fixes and workarounds needed are activated and only then. These are especially required when running under Python 3.8 as this seems to be especially more strict upon pickling and unpickling lambda functions compared to earlier versions. Even more when the to be unpickled functions have vanished from the module they were defined in earlier versions of hickle. This does not in any case replace the decision upon final version number during release. In case pushing patch number instead of minor would then be preferred i provide any needed support in identifying which items would have to be amended to work properly.

phew finally awaiting your review and your suggestions for improvement and amendment.

1313e commented 4 years ago

Well, I better do some reviewing then huh? Wonder when I will have the time to review something this big.

hernot commented 4 years ago

Take your time, the next steps will definitly be smaller than the switch to loader based design in 4.0.0 and the extesnsion of PyContainer concept initiated in 4.0.0 now. Further a big part is covered by the rather thorough (hopefully) unit tests of individual modules (helpers, lookup, load_builtins, load_numpy, load_scipy, load_astropy, hickle) in addition to tests already present. These are also the reason why i stumbled over the compression issue related to scalars and strings as reported in #140 and implemented a proposal to fix. And some other things which fitted for demonstration purpose (confess #125). EDIT before final release together with memoization it is necessary to go through all loaders and check for strings and scalars which might otherwise be missed by the by then implemented compression handling fix for #140. I will open an appropriate issue timely when done with the rest.

hernot commented 3 years ago

Hi just a small ping on how the plans are.

1313e commented 3 years ago

@telegraphic and I are currently discussing the changes introduced in this PR. I am not entirely sure how long that will take, as both of us are rather busy. However, please note that we are working on it.

hernot commented 3 years ago

As you might have noticed i have also implemented proof of concept support for issue #125 which i hoped would make my data visible and exportable from python using hickle. While working on a first proof of concept study for #139 based upon this pull request i also did a more in detail test upon the implementation for issue #125. Below my findings and possible consequences:

In general it can be sayed the hdf5 File format is not designed and optimized for storing plenty of dictionaries as well as lists and tuples containing heterogeneous data especially if this data consists mainly of scalars and plenty of small numpy arrays and strings. It is optimized for storing large homogeneous matrices, images and imagestacks or referring to files containing them. Therefore simply adding support for the copy protocol which in most case produces exatctly that kind of data is not a wise idea as the already very view data can produce files of sizes which are in no relation to the actual data stored.

A first solution would be to properly design the getstate and setstate methods to ensure that the object state is represented such that only few groups and datasets are required to represent the objects. This is not really feasible for several reasons. Especially when trying and testing something content of Objects changes very quickly or dictionaries and lists are added and removed again. Which renders it non feasible to come up with a proper compact representation. And even when possible in most cases the objects are stored in long lists and reference other objects. Consequently this approach will always in the end bloat the hickle files.

Therefore i suggest the following alternative approach to issue #125: Add the possibility to register additional loaders with hickle either by providing it an addtiional lookup path where to find the loader modules or by directly calling register_class from within custom loader module and explicitly importing it into the python script or module which calls hickle.dump and hickle.loads. The hickle_loader_path approach would allow to develop custom loader modules which need not any manual pre loading. The benefit of either approach is as long as structure of data and objects is not yet settled and heavily changing it is sufficient to not register the loader functions or loader path with hickle in order to ensure corresponding objects are store within pickle strings. As soon as structure is fixed and an appropriate loader which is properly capable to sufficiently compact data is available it can be made available to hickle for recoding data or storing new data. Further if changes are necessary to data and object layout the loader simply can be deactivated to avoid any non desired side effects or damage to the data.

Therefor independent form your discussion and review i will before finally this pull-request can be merged remove the naive support for copy protocol again and revert create_pickled_dataset function to simply call pickle.dumps and load_pickled_dataset to call pickle loads as well as removing PickledContainer as it was necessary for copy protocol support which is not feasibly supportable by hdf5 file format. For discussion whether and how to support of custom loader modules i schetched the H4EP00 #145 which i have investigated in my private proof of concept branch and will receive a dedicated pullrequest as soon as this pullrequest and the upcomming pullrequest for H4EP002 issue #139 are appoved by @telegraphic and merged into dev.

hernot commented 3 years ago

In the middle of rebasing and removing support for python copy protocol. One last step necessary to make tests succeed

1313e commented 3 years ago

@telegraphic Thoughts?

hernot commented 3 years ago

Implementation for python copy protocol support now removed as indicated above and by the reason given why closing issue #125. A an idea how to reach the intended goal in hdf5 friendly manner i have sketched in HEP003 #145.

hernot commented 3 years ago

@telegraphic I know you are quite busy but never the less is it possible to timely decide on this as with each fix @1313e necessarily implements rebasing of this pull-request and all dependent branches for upcoming pull-requests becomes more complicated and tedious.

telegraphic commented 3 years ago

Hi @hernot and @1313e -- I'm taking a dive into the PR this week. I'm currently on paternity leave so have been even slower to respond than usual (first child so tough learning curve!).

For now, let me say @hernot I really appreciate all the effort and thought you've clearly put into this. Apologies for my tardiness!

hernot commented 3 years ago

@telegraphic no worry, every thing is set and prepared here, so just waiting for your go.

telegraphic commented 3 years ago

Making some notes here as I go through:

Compare file structure 4.0.4 and 4.1.0

# Make a basic test file with 4.1.0 and 4.0.4
a = {'a': 1, 'b': 'hello', 'c': [1,2,3]}
b = np.array([1,2,3,5])
c = (a, b)
hkl.dump(c, 'hkl_410.hkl')

# (rinse and repeat for hkl_404.hkl with 4.0.4

Compare the two file structures:

 h5diff -v hkl_410.hkl hkl_404.hkl

file1     file2
---------------------------------------
    x      x    /
    x      x    /data
    x           /data/data0
    x           /data/data0/"a"
    x           /data/data0/"b"
    x           /data/data0/"c"
    x           /data/data1
           x    /data/data_0
           x    /data/data_0/'a'
           x    /data/data_0/'a'/data
           x    /data/data_0/'b'
           x    /data/data_0/'b'/data
           x    /data/data_0/'c'
           x    /data/data_0/'c'/data
           x    /data/data_1

group  : </> and </>
0 differences found
attribute: <HICKLE_PYTHON_VERSION of </>> and <HICKLE_PYTHON_VERSION of </>>
0 differences found
attribute: <HICKLE_VERSION of </>> and <HICKLE_VERSION of </>>
size:           H5S_SCALAR           H5S_SCALAR
position        HICKLE_VERSION of </> HICKLE_VERSION of </> difference
------------------------------------------------------------
[ 2]            0            1
[ 4]            4            0
2 differences found
group  : </data> and </data>
0 differences found
attribute: <base_type of </data>> and <base_type of </data>>
0 differences found
attribute: <type of </data>> and <type of </data>>
0 differences found

The main difference is data_0 --> data0 and dictonary items no longer have trailing /data -- no strong feeling on first and latter is definitely an improvement :+1:.

The difference in file format means 4.0.4 will not be able to load 4.1.0. It fail with a ValueError: Provided argument 'file_obj' does not appear to be a valid hickle file! ('int' object has no attribute 'name'). I suggest we add a check if the HICKLE_VERSION in the file (e.g. 4.1.0) is greater than installed version (MAJOR MINOR but ignore PATCH), and if it fails to load, it prints a 'try updating hickle to latest version' message.

Check 4.1.0 can load 4.0.4 still

Yup (at least for this basic test file).

telegraphic commented 3 years ago

Hey @hernot and @1313e, overall I am happy for this to be merged, as it is a precondition on H4EP002 and H4EP003.

What changed and was it worth it?

Firstly, I note that this is a pretty major underlying change to how hickle works under the hood. Even though there are a lot of changes to the loading/dumping logic, changes to the actual file structure are minor. I like the changes to how dictionaries are dumped without extraneous /data, which was made possible by the changes to container loaders. The changes add a level of abstraction, to allow "the possibility to support additional container like objects" in the future. Put another way: the changes are supposed to help make it easy to store custom classes.

Part of the motivation for the changes in H4EP001 was issue #125, to do with use of __getstate__ and __setstate__ for custom classes that @hernot used in his research. After rather lengthy discussion, #125 was closed with the conclusion that "hdf5 data format is not really designed for storing dict, list and tuple structures containing vast amounts of heterogenous data especially if they are organized in a rather chaotic huge tree like structure". #145 offers an alternative approach for custom classes: a specification that if the class supplies __compact__ and __expand__ dunder classes then hickle will be able to understand and store the class (see new issue #148).

So, the big question: are the major changes in H4EP001 worth it? At first glance, the end functionality to the user has not changed!

A quick case study

To weigh this up, I found it useful to look at how the scipy sparse matrix support changed -- see diff here. Sparse matrices are stored in the hdf5 file as three datasets -- which need to be recombined into a single sparse array when loaded by hickle. To allow this previously required the exclude_register in lookup.py. The new functionality implements

class SparseMatrixContainer(PyContainer):

which I think is a slightly better design pattern. However there is some more complexity in lookup.py to maintain backward compatibility. The register_class method is now:

def register_class(myclass_type, hkl_str, dump_function=None, load_function=None, container_class=None):
    """ Register a new hickle class.
    Parameters:
    -----------
        myclass_type type(class): type of class
        hkl_str (str): String to write to HDF5 file to describe class
        dump_function (function def): function to write data to HDF5
        load_function (function def): function to load data from HDF5
        container_class (class def): proxy class to load data from HDF5
    Raises:
    -------
        TypeError:
            myclass_type represents a py_object the loader for which is to
            be provided by hickle.lookup and hickle.hickle module only

    """

Where container_class will take a subclass of PyContainer, such as SparseMatrixContainer. We will need good clear documentation and examples for how to register your own class!

My conclusions

These changes add complexity, but they do promise to make some things easier in the future. I think, @hernot and @1313e, we can now agree that some data structures in Python are not easy to map to HDF5 optimally without some ugly code...

Overall I am supportive of merging this given H4EP002 and H4EP003, which extend H4EP001 with more tangible improvements. The improvement to the file structure when dumping dictionaries is also nice.

My apologies once again for the latency on the review. Thanks @hernot and @1313e for your patience. Small request: @hernot in the future, can you pretty please make smaller commits instead of one large one so it's easier to review? (I know this is difficult when refactoring, but it would be very helpful!)

1313e commented 3 years ago

I am still very sceptical of the usefulness of this PR. It contains many changes that I would rather not see go through (which cannot be undone due to the lack of this PR being split into several commits), and the feature implementation as a whole is not complete. Because of the latter, I think that this PR should not be merged into any branch of hickle until it contains a completely implemented feature. This will require H4EP002 and H4EP003 (apparently) to be fully implemented. I suggest that this PR is instead kept on @hernot's fork (given that they are the one working on this), and only after the aforementioned proposals have been implemented, that this PR can be merged into the main hickle repo.

hernot commented 3 years ago

NOTE: For testing purpose and thus exceptional form usual release procedure I bumped the minor version number to 4.1.0. This allowed me to properly test that when loading files created by hickle 4.0.x the related fixes and workarounds needed are activated and only then. These are especially required when running under Python 3.8 as this seems to be especially more strict upon pickling and unpickling lambda functions compared to earlier versions. Even more when the to be unpickled functions have vanished from the module they were defined in earlier versions of hickle. This does not in any case replace the decision upon final version number during release. In case pushing patch number instead of minor would then be preferred i provide any needed support in identifying which items would have to be amended to work properly.

@1313e @telegraphic that is why i wrote this note in one my comments above above. I'm pretty fine if you consider this and all the already prepared following pull-requests, which will introduce further even bigger changes to file format rendering them major and not just minor changes which require to spare them for hickle >= 5.0 release instead, of just bumping version 4 minor. Thinking about it it might be anyway the wiser option.

@1313e I'm pretty fine to first assemble all prepared pieces (#138, #139, #145, and clean-up and finalization) in a hickle 5 RC-1 proposal branch in my fork. Thereby it is for we very important that i do this in full agreement and coordination with you two @telegraphic and @1313e.

So may I suggest that ) create a Hickle-5-RC branch in my forked repo, add there all prepared pull-requests as commits and post here when ready for discussion about it either in continuation of this discussion or as part of together reviewing of the new branch. Meanwhile i do suggest as i already did once to prioritize #141 which just globally makes hickle ignore any compression related h5py keyword parameter and issues an appropriate warning as @telegraphic suggested on issue #140. And we use the hickle-5-RC branch to prepare hickle 5 without any interference with current productive version of hickle 4.

1313e commented 3 years ago

Yes, that's fine with me.

1313e commented 3 years ago

Also, the compression thing I already fixed.

hernot commented 3 years ago

I would have appriciated also getting the OK from @telegraphic before closing

telegraphic commented 3 years ago

Just following up: I see @1313e's point that major changes should have functionality improvements, and see you've opened a new PR for a v5 bump -- all sounds good to me. I am more ambivalent about the code changes, so @1313e when we have RC5 ready I'll be relying on you to identify areas that you flag for reversion / request changes.

1313e commented 3 years ago

Just following up: I see @1313e's point that major changes should have functionality improvements, and see you've opened a new PR for a v5 bump -- all sounds good to me. I am more ambivalent about the code changes, so @1313e when we have RC5 ready I'll be relying on you to identify areas that you flag for reversion / request changes.

Sounds good to me :+1: