Closed cyfra closed 2 years ago
Hi @cyfra,
we need a bit more information than that.
For example, what version of hickle
are you using?
Alright, I found the issue.
In h5py
3.0, strings stored in HDF5-files are now returned as unicode rather than bytes.
This would require changing all places where strings are being read in to function properly.
@telegraphic I think the best option for now is to exclude h5py
3.x from the requirements until they have released a few versions of it.
Besides writing a function that checks what version of h5py
you are using; uses the proper method of reading strings; and applying it literally everywhere where hickle
reads strings, we cannot make a compatibility change for this.
Forcing everyone to upgrade to h5py
3.0 right now, feels like a bit too much to me.
@1313e - thanks a lot for looking into this so quickly.
@1313e @telegraphic just in case it might be of any interest to you or even any help at all i wanted to let you know that: Since a while ago i got in contact with tox package on another project I'm contributing to and i have to admit i like it. Therefore i was curious whether it would be possible to simplify maintaining production release while at the same time in parallel work on another branch on eg. issues arising when wanting to support h5py. 2.X versions in parallel to recent 3.0 and future 3.X versions. The results of this curiosity can be found in the detached concept_memp_compact_expand branch of my hickle fork. Be warned the only test working therein is the test_stateful.py for which at least one > 100 MB file is required i have to share with you via file-sharing service as github does not allow to link forked repos with git(hub) large file service. Further even though i have edited .travis.yaml and .appveyor.yaml according to documentation to be aware of running tests through tox instead of directly calling pytest i have, thanks to the other project some idea about travis but none about appveyor, if that works as intended or at all.
Alright, given that I have had very little time lately to do things, I am just removing h5py
3.x from the requirements for now.
This is the only one where i just have as a by product of compression safety proofing an unfinished and to further discuss and extend suggestion, which will be included in finalize and clean up pull-request.
Ok one issue which will definitely will persist even after memoisation implementation is the fact that h5py >= 3.0 converts all bytes
strings like 'item_type' which were assigned to an attribute to a str
assuming that they were created form python2 strings or are just manually encoded python3 str
objects. The only option i have found to enforce creating an bytes
type attribute is to use the create
method of the attrs
property this allows to explicitly specify the dtype of the attribute
some_dataset_or_group.attrs.create('item_type',data=b'str',dtype = 'S{}'.format(len(b'str')))
print(some_dataset_or_group.attrs['item_type'],type(some_dataset_or_group.attrs['item_type']))
print(some_dataset_or_group.attrs.get('item_type'),type(some_dataset_or_group.get('item_type')))
Alternatively just support h5py >= 3 in newer version of hickle and sacrifice support of h5py < 3 which I'm not sure if this is currently an option. Just as a side node with type memoisation in place this will not be an issue for 'type' and 'base' type attributes. But reading older files will be impossible with hickle versions upporting h5py >= 3 as pickle string will cause encoding errors when h5py tries to convert them to utf8 used by str.
Not yet seen reported issue from unit tests as added by pr #138. keep you posted as i also would be interested in using hickle on newer systems where h5py >= 3 is already installed and switching back to h5py < 3 is no option.
Ist mostly harmless. For the by far most strings which are stored by intention as python bytes like base_type or str_type it is sufficient to replace for example
if str_type == b'str':
....
by
if str_type in (b'str','str'):
.....
And for the base_type
type strings it is sufficient when register_class method simply adds two entries for each loader in the hkl_types_table
(hickle <= 4.0.3 as of 4th January 2020)
# in register class
if load_fcn is not None:
hkl_types_table[hkl_str] = load_fcn
hkl_types_table[hkl_str.decode('utf8')] = load_fcn
The same holds for key_base_type strings:
if key_base_type in (b'str','str'):
....
dict_key_types_dict = {
b'float': float,
'float': float,
b'int': int,
'int': int,
...
}
A bit more trickly is how h5py >= 3.0 handles strings stored by dump_scalar_dataset. As these strings are stored without explcitly speciifying the dtype for the dataset it requires special provision when loading data written by hickle <= 4.0.3. on loading.
#in load scalar dataset
content = h_node[()]
if py_obj_type is str and object in h_node.dtype.name and h_node.attrs.get('str_type',None) is None:
if not isinstance(content,str):
content = content.decode('utf8')
The good news is pickle strings store in the 'type' attribute are not affected at all. Likely as they contain bytes with are not part of valid utf8 encoded strings they seem to be returned in any case as bytes strings.
As soon as @telegraphic finds time that we can finalize #138, work on the already perpared but not yet published pull requests for #139 and #145 i could provide also a production ready fix for this in the also already prepared finalize and clean-up pull request. The only thing to be discussed is still if two distinct pytest runs for h5py 2 and h5py 3 compatibility are required or if hickle shall always be tested against latest h5py version by default and only if users report issues with h5py 2 than tests are added with allow to mock h5py 2 behaviour under latest h5py version.
@hernot Any update on this issue? h5py < 3.x is not providing wheels for python 3.9 on windows, so this blocks hickle on python 3.9
From Python 3.9 on only h5py >= 3.x and that is 64 bit only. In addition manual compilation and installation via pip will also fail, cause h5py >= 3.x requires libhdf5 versions which are too available 64bit only.
Forthcomming hickle >= 5.x, currently reviewed by @1313e and @telegraphic, will support h5py >=3.x. From Python 3.9 support 64bit is the only option. On Python < 3.9 it will fallback to h5py 2.10 especially for windows 32 bit.
@1313e are there any plans for hickle 4.x to backport the requirements?
If you need win32 than you are locked in to Python < 3.9 and h5py 2.10. If you need windows 64 bit for now you are locked in to Python <= 3.8 and h5py 2.10 too. For using h5py >= 3.0 and/or Python >= 3.9 you will have to wait for hickle 5.
Until hickle >= 5.0 is released explicit installation of h5py 2.10 along with hickle especially for windows 32 is possible for Python <= 3.8.
pip install h5py==2.10 hickle
The bad news is for Python >=3.9 you likely have to wait for hickle 5 and migrate to windows 64 bit unless downgrading to Python 3.8 or older is an option for you.
@peendebak @hernot Yeah, I think we should backport this to hickle
v4.
I am currently in the process of moving back to my home country, but I should have time to do that afterward.
So, I hope you can wait for a few weeks, as it isn't a very trivial process.
Unless you want to give a go yourself of course.
@peendebak if you are intending to attempt the backport i can help and share my knowledge related to the topic i gained through preparation of hickle 5 release candidate. @1313e on atempting to backport shoud than hickle 4 also switched from travis and appveyor to gh actions ?
@hernot Probably, yes. Moving away entirely from Travis CI and Appveyor would be really great.
Hickle seems to stop working with h5py 3.0.0. It works fine with 2.10.
Example code:
Fails with