telegraphic / hickle

a HDF5-based python pickle replacement
http://telegraphic.github.io/hickle/
Other
485 stars 70 forks source link

Hickle 5 rc #149

Closed hernot closed 2 years ago

hernot commented 3 years ago

Ok here we go as agreed discussing pr #138 here the fully assembled hickle-5-RC branch. In case you prefer to directly review within my repo feel free to close this pr again. I'm fine with whatever you prefer. Not sure if all commits reported below are really visible from within this branch or whether the log below is an exceprt of the git reflog.

I also expcet that appveyor and travis to complain a bit due to the tox related changes would be surprised if things would work on the first try. Anyway i cross fingers

Reason for python 3.5 failure known (cant test here any more lacking Python3.5). Fixing if support of Python 3.5 should be kept beyond hickle 4. Is easy just one change in one line.

Fixing fail in Pyhton 3.5 is one change in one line will do if support for Python3.5, which i have no means any more to test locally here) shall be supported beyond hickle 4. No problem at all.

With the astropy warnings i would need your help as we do not use Astropy here so i have not any clue how to fix.

1313e commented 3 years ago

What is commit d1760ee doing in here? That is a commit from over 2 years ago and has no relevance at all to the current state of hickle.

hernot commented 3 years ago

Shit definitly some rebase error not sure how to get it out again. Shouldnt be there. Too late for today.

1313e commented 3 years ago

Also, given that this would be hickle v5, it is not necessary that the base code can handle v4 files. Support for v4 files can be added in the same way as v4 currently handles v3 files.

hernot commented 3 years ago

it does, and differences are rather small that they can be easily be handled by stacking the few specialized loader functions on top of the default loader functions, like the compact expand which is also implemented as optional loader which on request is stacked on top of all others. Further i do fear that the the issue i observed only when running tests in python 3.6 and 3.7 in connection with h5py3 would not vanish with pulling in hickle 4 files ad dedicate legacy loading. And this way users who still have not yet had time to convert their files from hickle 3 to 4 still can do this using matured hickle 5. Thus unless severe ostacles and issues are against i would prefer to keep support as implemented ;-)

1313e commented 3 years ago

It is not like v5 will release any time soon, as many, many changes have to be made to this before it can be merged into master.

hernot commented 3 years ago

didn't expect to be, and as long as I'm able to keep in sync which your changes on dev it should be quite easy for me to take care of the amendments necessary to reach a releaseable state. Just let me know where you have your doubts, where you think improvement is necessary or where things have to be pushed further than done. I'm open to any input and hints. the only thing I want to avoid that in any cases sooner or later there exist two incompatible versions of hickle, the official one and mine even if i start to use mine for now all findings will be reported here and go back into branch .

telegraphic commented 3 years ago

Thanks @hernot, looking forward to this coming to fruition.

hernot commented 3 years ago

Looks like things change these days Numpy 1.20 introduces changes which prevent nupy.dtype classess like numpy.int16 from beieng pickled and solution will not released before numpy 1.21 (detauls see commit message). Latests hdf5.dll (1.12) for windows is only available in 64 bit no mor 32 bit which would be a plausible explanation why appveyor 32 runs from python 3.6 on fail while 64 bit runs finish properly. For python 3.5 it seems as if h5py would still recommend 1.10 or 1.8 for which 32 bit windows dlls are available.

Not sure which is better:

hernot commented 3 years ago

Just to let you know currently amending commit https://github.com/telegraphic/hickle/pull/149/commits/940710c6f3c954f5a43f0a56fe5444c6542cf6eb to finally get rid of copy protocol mimicry as it turns out that it is not necessary to achieve what is intended by H4EP003 #145 (see latests edits of #145). As soon as finished with amendment i will take care of the rest for which i would apriciate an advice whether to issue limit versions on win32 or whether to drop win32 support on proposed hickle 5 release candidate.

1313e commented 3 years ago

As 32-bit Windows versions are still widely used, I think support for them should not be dropped. However, if h5py no longer supports it, we might have to drop it as well.

hernot commented 3 years ago

Will check how win32 issue can be solved with least effort and maximum gain

codecov[bot] commented 3 years ago

Codecov Report

Merging #149 (8bae7bb) into dev-v5 (3b5efbb) will not change coverage. The diff coverage is 100.00%.

:exclamation: Current head 8bae7bb differs from pull request most recent head 8c2d5eb. Consider uploading reports for the commit 8c2d5eb to get more accurate results Impacted file tree graph

@@            Coverage Diff             @@
##            dev-v5      #149    +/-   ##
==========================================
  Coverage   100.00%   100.00%            
==========================================
  Files            9        12     +3     
  Lines          592      1079   +487     
==========================================
+ Hits           592      1079   +487     
Impacted Files Coverage Δ
hickle/__init__.py 100.00% <100.00%> (ø)
hickle/__version__.py 100.00% <100.00%> (ø)
hickle/fileio.py 100.00% <100.00%> (ø)
hickle/helpers.py 100.00% <100.00%> (ø)
hickle/hickle.py 100.00% <100.00%> (ø)
hickle/loaders/__init__.py 100.00% <100.00%> (ø)
hickle/loaders/load_astropy.py 100.00% <100.00%> (ø)
hickle/loaders/load_builtins.py 100.00% <100.00%> (ø)
hickle/loaders/load_numpy.py 100.00% <100.00%> (ø)
hickle/loaders/load_pandas.py 100.00% <100.00%> (ø)
... and 4 more

Continue to review full report at Codecov.

Legend - Click here to learn more Δ = absolute <relative> (impact), ø = not affected, ? = missing data Powered by Codecov. Last update 3b5efbb...8c2d5eb. Read the comment docs.

hernot commented 3 years ago

Ok finally 32 Bit windows fixed and Install instructions in README.md amended (see commit message)

Issue was/is:

To conclude:

telegraphic commented 3 years ago

@hernot Just checking this is ready for review now?

hernot commented 3 years ago

@telegraphic yes it is.

telegraphic commented 3 years ago

Tests

Unit tests pass on my machine.

Memoization test (HEP002)

It seems to me that memoization is doing what it should. I wrote a simple test:

import hickle as hkl
import numpy as np
import os

# Generate a reasonably large data array
data = np.arange(0, 2**20)

N_copies = (1, 10, 100, 1000)
filename = 'testdata.hkl'
print('hickle version:', hkl.__version__)
for N in N_copies:
    datadict = {}
    for ii in range(N):
        datadict[f'datacopy_{ii}'] = data
    hkl.dump(datadict, filename)
    filesize_kib = os.path.getsize(filename) // 1024
    print(f'{N} copies \t filesize {filesize_kib} kiB')
    os.remove(filename)

And as expected older hickle version increases filesize by creating multiple copies:

hickle version: 3.4.5
1 copies     filesize 8200 kiB
10 copies    filesize 81944 kiB
100 copies   filesize 819376 kiB
1000 copies      filesize 8193704 kiB

New version the filesize does not increase (except for some book-keeping)

hickle version: 4.1.0    # Needs bump to 5.0.0?
1 copies     filesize 8200 kiB
10 copies    filesize 8205 kiB
100 copies   filesize 8238 kiB
1000 copies      filesize 8595 kiB

Comparing shows it is the same object:

import hickle as hkl
d = hkl.load('testdata.hkl')
d['datacopy_1'] is d['datacopy_2']

So this consitutes useful new functionality, which was a bar we set earlier as a prerequisite before we consider merging.

Other

Conclusions

This PR implements HEP002, memoization. Pending comments and requests from @1313e before merging.

hernot commented 3 years ago

Hi @telegraphic Good to hear, exempt you forget it also imlements #135 H4EP 001 Container and mixed (datset + Container) loaders. Without implementing #139 H4EP002 would not have been possible.

I do fear in terms of astropy including removing version constraint I'not much help here unless beeing guided by you @1313e.

I'm currently quite busy until end of may but afterwards i can add some tutorial on how to add custom loaders using register_class and define loader modules whithin local hickle_loaders directory.

Concerning big refactors

I do think. After the two big refactors in 4.x and 5.x things will be more clear and a lot more manageable and reviewable again. I do not see any further need for a big refactor you @1313e? I would rather think that any future big refactor would either be necessary due to radical change underlying libhdf5 and h5py leaving hickle disfunct. Anything else i would rather question to be a design flaw or even an attempt to implement something at hickle level which is best handled by h5py or even hdf5file file level by libhdf5.

For example meanwhile i would consider #133 such a carefully to investigate and define feature. Cause libhdf5 is designed with the aim to hide any details about how and where the data is stored from higher levels like hickle. In other words as long as h5py does not provide high level interface for mapping/mounting multiple files into one single hdf5 file structure i would consider #133 as not feasible for hickle.

1313e commented 3 years ago

I have been sick for the past week or so, but I hope that I will be able to look at this tomorrow.

hernot commented 3 years ago

Global things:

* Why are the test files numbered now? There is no point in doing so, as it is just confusing.

Cause the numbered ones contain the unit test for each module. The numbers ensure that pytest does execute the unittests for the helpers.py module before the lookup.py module and that before the tests before the ones for all loader modules and the hickle.py module. The reason for enforcing this order(*) of the unit-test is that for example the _load, and _dump function depend upon function and objects provided by lookup and helpers etc. An error within for example the lookup.py module causes a unit test for _dump or _load fail even in case both methods would be fully functional. The tests still do fail yes. But they fail after the related lookup.py unit-test already has failed. Consequently one would first fix the error in lookup rerun pytest and figure that the errors on _dump and _load also have vanished.

So instead of manually calling the tests execute them through pytest. To make pytest stop in post mortem debugger when an error occurs i typically use for example

pytest --pdb

and to debug the error causing a specific test to fail i call for example

pytest --trace hickle/tests/02_lokup.py::<name of test function>

and than i can use normal pdb commands to set break points, step into function or step over and inspect variables. Alternatively especially when verifying prior to uppload to github that nothing breaks in newer or older python version i run tox and if necessary append the above pytests opptions following a double --.

All other issure related tests remained in the unumbered files. Hope i could shed some light on why numbered

(*) This is how I could identify all places where temporarily disabling compression and chunking would be necessary beyond loaders defined by load_builtins.py loader and load_numpy.py.

* I would very highly appreciate it if you could, while you are at it (I can do it as well if you want), change the CI service to GitHub Actions instead of Travis CI and AppVeyor.

  I am currently kinda having a vendetta against both of them (although mostly Travis CI).
  My [CMasher](https://github.com/1313e/CMasher) and [e13Tools](https://github.com/1313e/e13Tools) packages already use it, so you could look there for how that works.

Sure if this simplifies the travis, appveyor hell.

* Fix all spacing/indentation issues in all docstrings that have been written. Parameter names should be at the same indentation level as the section name.

I do

* Make sure to refer everywhere to this being v5 and not part of v4 in anyway to avoid confusion later on.

Ok i take care

* Check all your `# pragma: nocover` (which btw should be written as `# pragma: no cover`) and make sure that they are all correct.
  I see many of them in places that definitely should have a test covering them.

I do check. There may be some which the whole section might even be removed as they are superfloursous and would never be triggered at least i could not come up with a reasonable and plausible unittest hitting them. So I'm not sure to simply remove them and wait until somebody manage to cthat helped toreate a condition which would have triggered them. Not shure what is the best approach. Those marking catch clauses setting some variable to None are there cause behaviour has changed between h5py 2.10 and 3.x. Or exception thrown by h5py could not be triggered any more reliably by unit test when using 3.x while it is thrown reliably in 2.10. In order not to increase test compexity by introducing a dependency upon h5py versioni decided to without further testing convert the case triggering the exceptoin into the other where None is returned instead of exception thrown.

It might take some time as i'm currently a bit busy until mid may.

hernot commented 3 years ago

@1313e: As I'm at fixing indentations in doc string, just a minor but to have doc consistent which preferably to be used as heading for function/method input parameters/arguments section.

Args:
-----

or

Parameters:
-----------

Currently inconsistently used and mixed.

1313e commented 3 years ago

I always use

Parameters
----------
telegraphic commented 3 years ago

I always use

Parameters
----------

Yup I'm happy with Numpy-style and agree consistency is best. (I am however guilty of mixing google-style -- my current preference -- and numpy-style in projects).

hernot commented 3 years ago

I do @1313e or @telegraphic any of you has to check hickle repo settings and if possible allow github actions defined in any branch of my hickle/fork to be run otherwise i will not be able to finish switching to gh_actions. And in the end you will have to disable travis and appveyor anyway. simply removing their yml's seems not to be sufficient. Not at least for appveyor.

It seems that if gh_action are not defined on the original repo they are not run unless explicitly allowed (not sure). Otherwise create a dummy .github/workfowl/test.yml in dev-v5 branch that might also do the trick if the other is not feasible.

1313e commented 3 years ago

AppVeyor is currently not disabled specifically because you haven't removed the file yet. Also, you should just write - python-version: 3.9 on line 19.

It will not be possible to enable GitHub Actions for this PR yet, because it requires it to be set up on the branch you are merging into.

hernot commented 3 years ago

No worry. I just figured that they are run anyway, it's just that the results are not posted in this conversation but soley sent to my email. And that there can only be a single value on line 19 or that lists with single values are not possible i already suspected and would have tested the next. Thank you very much. I will let you know when I have managed to provide a working gh_action (i will squash all my trial and errors before) ps.; concering disabling apveyor by simply removing its yaml didn't stop it from running but resulted in failing test saying that appveyor is setup for visual compiler but no appropriate setup found. you can until is sqash check report https://ci.appveyor.com/project/telegraphic/hickle/builds/38985770

1313e commented 3 years ago

Oh, @telegraphic will have to remove the webhook at some point then. Anyway, just remove the file. I don't care it is complaining about the file not being found for now.

Also, GitHub Actions gives you a notification on GitHub. I am assuming you have email notifications turned on. If you go to your profile->Notifications, all the way at the bottom, you can change this behavior.

hernot commented 3 years ago

OK Action seems to work exempt for now they run on my fork. Either they have to be merged into hickle/repo branch to run here or even may need some further limitation not to be run on forc as this prevents pushing coverage reports. From my side i think every thing is done. Did some spellchecking of comments, messages and docstrings to remove hopefully the majority of my typos and squashed all the failed action etc.

1313e commented 3 years ago

Yeah, that looks fine to me now.

telegraphic commented 3 years ago

Ok @1313e webhooks deleted. Anything else I need to do to get actions up and running?

1313e commented 3 years ago

Ok @1313e webhooks deleted. Anything else I need to do to get actions up and running?

You should have waited with that until we actually merge this into the master branch... All normal branches still require AppVeyor for now.

Guess all branches will not have Windows testing anymore on AppVeyor from now on.

hernot commented 3 years ago

Just a small sync ping close to the end of the heavy loaded end of semester seasons.

hernot commented 3 years ago

Seems as if github action will not be able to upload coverage reports to this pull request until merged with dev-v5 branch on this repo. Until then pull_request and pull_request.synchronize events have no idea that they should trigger the correpsonding workflow. So errors affecting upload one of you @telegraphic, @1313e or any other of your team will have to fix. If needed i can help in testing that it is triggered properly on pull_request, pull_request.synchronize, pull_request.reopened and pull_request.edited events. Testing that it is properly pushing coverage reports on push events only you can test.

All the rest not dependent upon any secrets owned by @telegraphic or only available to telegraphic/hickle repo I tested on my forked repo.

1313e commented 3 years ago

Alright, that is okay. I will have a look at all this next week, as I will finally have time for it. :)

hernot commented 2 years ago

@1313e any news, questions or recommendations for further improvements?

1313e commented 2 years ago

Well, I said I would have time, but I didn't really in the end. So, I haven't had a chance to look at it yet. I hope to be able to look at it soon.

1313e commented 2 years ago

Alright, I finally have had the chance to look at it. (I am really sorry for the delay). @hernot Can I just check with you that all of the changes are in this PR and there are no partly implemented features in there?

If not, then I will probably accept this PR.

hernot commented 2 years ago

Sure just name possible dates and time. Nothing worse than missing or partially broken issues. If you prefer i than can create a zoom meeting or we can meet on Skype or matrix whichever you prefer.

1313e commented 2 years ago

Oh no, I meant that you can just confirm here whether there are no partial implementations in the PR. I don't think there are, but I just wanted to have that confirmed.

hernot commented 2 years ago

Ah sorry, sure i check if OK than I'll do it the forthcoming weekend latest. Here everybody is just busy with managing semester start in bachelor courses.

telegraphic commented 2 years ago

Just chiming in to say that I'm also happy to accept if this is indeed 100% feature complete!

@hernot good luck with start of semester (it's end of semester here in Australia)

hernot commented 2 years ago

@1313e, @telegraphic I can confirm that everthing which should be in this PR is fully implemented, maybe exempt some minor/micro improvements, simplifications and fancy beautifications which are better dealt with by future minor and patch fix releases if at all.

Following a short summary what is covered by this PR.

Anything else is beyond the scope of this PR.

The only thing left is to ask you @telegraphic as being native English if you could proofread and check my spellings and wordings in comments, doc strings and the README. I have to admit I'm a rather poor speller in general and love entangling sentences like Lego bricks into to gigantic constructs.

hernot commented 2 years ago

Hi Can you wait until forthcoming Weekend with merging. I just noticed that numpy 1.21.X versions are out and i want to check if the bug they had in 1.20.X versions is gone, (see my comment from 18th Feb 2021) . If successful I would modify numpy requirement such that it just skips buggy 1.20.X versions.

Optimal case would be that all tests are again successful with 1.21.X numpy versions.

hernot commented 2 years ago

!"%/$"!§$!"§$! Basically numpy 1.21 works but there is a small issue with dill which refuses to pickle type of numpy.dtype('int16') and alike objects. Aborts hickle/tests/test_hickle.py::test_numpy_dtype with the following error

_pickle.PicklingError: Can't pickle <class 'numpy.dtype[int64]'>: it's not found as numpy.dtype[int64]

Means the creation of the h5py.dataset representing the numpy.dtype object works but storing the picklestring describing the corresponding py_obj_type fails.

When replacing all dill by plain python pickle for dumping py_obj_type the very same test passes like a charm. So it must be an issue with dill which requires that the __qualname__ or __name__ of the dtype object is something which is accessible and constructible or so. see. The following options are there 1) disable test for now (mark xfail) until fix for dill is there and hope that nobody is trying to hickle a dtype object 2) store py_obj_types as plain pickle strings not dill pickle strings and find a way to read hickle4 py_obj_type strings created by dill.

3) still stick to numpy < 1.20 until dill project has fixed 4) 3D print some home brew workaround and glue it onto hickle. But have no idea which.
5) shrug

In any case I would open a issue on dill project so that they can have a look and take care (dill issue #434).

EDIT Made some local tests. Basically all tests pass with python vanilla pickle.dumps and pickle.loads. For legacy loading of hickle 3.X and hickle 4.x dill just has to be installed to ensure pickle.loads can implicitly import it and call its load methods. So in therory when there would not be other reasons why hickle has to stick to dill, there would be the following option:

Having thought all the above i have to admit I do not have any use-case for dill and thus also fail to understand what the extra compared to python vanilla pickle package would be in the context of hickle dumping python objects in memory to hdf5 file. Not sure what extra python interperter states that would be and how they could be preserved by dumping to hdf5 file? That at least would be how i understand the claims of dill package. Sorry i actually had not intended to start a discussion here but i feel a bit lost especially as dill package seems to me to either ignore all pyhton3 related improvements of pickle package or try to handle them by workarounds for specific issues related to individual parts of packages like numpy dtype in this case. I subjectively rather feel like switching back to python vanilla pickle but thats my personal gut feeling, so do not bother about.

hernot commented 2 years ago

Apart from the above file-format-o-sophic question "dill weed or not dill weed" i have found during testing two severe bugs the fixes of which cause changes to the hickle file structure. They should be fixed before merging to avoid having to do a hickle 6 major release shortly after hickle 5 release.

hernot commented 2 years ago

Ok fixes done. Numpy depencency still not changed cause a bit affected by the issue current dill versions have with dumping type(numpy(>=1.21).dtype required for anotating numpy.dtype with their proper py_object_type. Future dill release will have a dedicated fix for it. But in general this will occur for any project which like numpy uses dynamic classes and alike. And it is only to be expected that dill project will add further workarounds and fixes for popular packages like numpy. But not smaller projects or personal projects. So the only solution in the current state is either to combine lastest numpy and lastes dill including the fix or manually resort to numpy <1.20. Both solutions are quite limiting and reduces the usability of hickle.

After testing the following options feasible and reasonable for hickle are possible. Need to know which route to go:

NOTE windows runners on GitHub actions seem not to expose the most consistent configuration. Some of them seem to be configured to globally allow read access to files opened for writing while others prohibit this causing the error described in log message of latests not planned commit. Just stumbled across when preparing the dill-or-not-dill within my forked repo. I wanted to spare you the despair i faced thereby until accepting what my gut suspected from the start.

EDIT Found another bug breaking package, module and main script loader module support. I fear i deleted it accidentially, when cleaning coverage flagging # pragma: nocover comments. Luckily i found it. Let me know if i should squash all the small fixes into one single commit or keep them as they are.

telegraphic commented 2 years ago

Hi @hernot -- catching up on this now. I think a larger issue is that dill shouldn't be pickling numpy dtypes!

Numpy dtypes should be handled by create_np_dtype. Any idea why this isn't happening?

hernot commented 2 years ago

@telegraphic It is happening, the create_np_dtype is working properly. The problem is the creation of the py_obj_type attribute string. Which is a pickle string of for exampel np.dtype(int).__class__ that works flawwlessly with curren dill until numpy <1.20. With numpy >= 1.20 the numpy project has changed a few things including that they now use dynamic class objects for encoding the class for each alife dtype.

If you look at type.mro(np.dtype(int).__class__ it looks like the following in numpy >= 1.20: [ <class np.dtype[int]> <class np.dtype> <class object>] as you can see the actual class has a reprstring and a __qualname__which is not a vaild python expression. When this change was first published with numpy 1.20.X this caused a problem with standard pickle too. But in numpy >= 1.21 they have added a copyreg.dispatch_table entry for type(np.dtype) type objects which handles the dumping of the new style np.dtype objects.

Dill does not make use of this table so it is not able to properly handle this case. I had last weekend a discussion with numpy and dill project with the result that the dill project has added a workaround for exactly this case which will be available in the next release. Checking the workaround it is not the only one the dill project uses to handle special cases like the new way numpy encodes the classes of its dtype objects.

In the light of the now official possibility to add custom loaders to hickle outside official hickle/loaders directory, it is to be expected that similar issues problem will occur more frequently with custom classes from users who report them as issue on the hickle project. Given the fact that the dill project has reasons why they still stick to python 2 style pickler and not utilize copyreg and other extensions introduced by python3, it is to be expected that there will only be dill side workarounds for issues related to projects being larger than dill project.

This on the other hand will sooner or later limit the outreach and use-cases of hickle, which, as far as i understand, was and is thought by you @telegraphic and @1313e as compatible standard pickle as possible and feasible. Therefore users have the ability to add their own loaders for their very local and specific classes and cases. If they are not picklable by pickle than hickle should not handle them either. But if they are properly handled by standard pickle i would expect that hickle does too and would be surprised that it doesn't just cause dill refuses to pickle the class object of my fancy object.

Hope i could shed some light on the core of the issue and what breaks when changing requirements for numpy form <1.20 to != 1.20 to include numpy versions >= 1.21

1313e commented 2 years ago

I think with all that, we should drop dill and go back to pickle again. I am somewhat surprised that dill does not use copyreg, because I think it should use that, but if they don't, then there is no reason to use dill anymore. After all, copyreg is how you add more objects to pickle. @telegraphic Thoughts?

telegraphic commented 2 years ago

Thanks @hernot for the explanation! If pickle handles this correctly, I'm happy to switch to it as it is part of Python core. This major v5 release is the optimal time to make the change too!

I think pickling with protocol=4 is a good option? (https://www.python.org/dev/peps/pep-3154/) (Assuming this plays nicely with numpy @hernot?)

hernot commented 2 years ago

@telegraphic Sorry misread your questions first. Normally pickle anyway will select the highest version available in python version.

In Python 3.5, 3.6 and 3.7 the default is 3 and the highest is 4 whereas from 8 on the default is 4 and the highest is 5. Therefore yes up to date there will be no problem reading files generated with any of the python versions currently supported by hickle. And from documentation it seems that this will not change so soon at least not for python 3.9, 3.10 and the forthcoming python 3.11