Closed 1313e closed 5 years ago
Hi @1313e, you are good at finding edge cases 🥇! No, that is not intended behaviour, likely to do with the check if something is iterable and with non-homogeneous data types.
I do know that hickling an empty dict does mark it correctly as a dict inside the HDF5 file, so I assume it has to do with the loading process.
Hi @1313e, you are good at finding edge cases ! No, that is not intended behaviour, likely to do with the check if something is iterable and with non-homogeneous data types.
Well, that is how I like to write my code: Taking care of all situations I can think of.
Dictionaries have indeed caused some pain in the past, I added support for unicode and integer keys back in #36. Theoretically any hashable type can be used for a key but not so for HDF5 dataset names :/
Nb: also likely related to #47.
Dictionaries have indeed caused some pain in the past, I added support for unicode and integer keys back in #36. Theoretically any hashable type can be used for a key but not so for HDF5 dataset names :/
Nb: also likely related to #47.
I was actually just about to mention that I tried hickling a dict that uses tuple keys, which seems to not work properly. Try doing the following:
>>> hickle.dump({(1, 2): 1}, 'test.hdf5')
>>> hickle.load('test.hdf5')
[(1, 2)]
This may be related to the issue with empty dicts being returned as empty lists.
@telegraphic May I ask you what the purpose is of this bit in helpers.sort_keys
:
to_int = lambda x: int(re.search(b'\d+', x).group(0))
keys_by_int = sorted([(to_int(key), key) for key in key_list])
return [ii[1] for ii in keys_by_int]
Why would it be necessary to sort the keys in order? Not only are there easier ways of doing that (use a sortedlist for example), but keys in a dict are never sorted anyways.
I am asking, since I fixed the problem with empty dicts not being returned as empty, but these lines mess that up.
If I disable them and simply return key_list
, all tests check out perfectly fine.
EDIT: Ah, I found why that thing exists. I am going to have a look at it.
Alright, I have fixed the problem with empty dicts not being unhickled as empty dicts. However, this does not yet fix the problem with using tuples as dict keys. I am looking into that now.
And I have now also fixed the problem with tuples not being usable as dict keys.
^^ Yeah that code above is terse. Sounds like you've figured it out, but used to support reconstruction of lists with heterogeneous types, e.g. [1, 1.0, 'a']
. This approach is slow if you have really large heterogeneous lists, but was the most sane way to support mapping into hdf5 data structures that I could think of at the time.
^^ Yeah that code above is terse. Sounds like you've figured it out, but used to support reconstruction of lists with heterogeneous types, e.g.
[1, 1.0, 'a']
. This approach is slow if you have really large heterogeneous lists, but was the most sane way to support mapping into hdf5 data structures that I could think of at the time.
I have not made any changes to how heterogeneous lists are saved or loaded at all. So, I am failing a bit to see what the connection here is. Care to explain?
Ah, of course. Going from memory here, but say you had:
a = [[1,2,3], [1.0, 2.0, 3.0], ['1', '2', '3']]
If you tried to store that using h5py.create_dataset
, it would fail due to the different types, But if you had:
a = [[1,2,3],[1,2,3], [1,2,3] ]
Then storing the whole thing as a single dataset is A-OK. So, in the first case hickle will create a set of datasets within a group:
\group
\data_0 - [1,2,3]
\data_1 - [1.0, 2.0, 3.0]
\data_2 - ['a', 'b', 'c']
HDF5 does not maintain order with these keys, so they are sorted first, then passed to the loader. Otherwise you may end up with
[['a', 'b', 'c'], [1,2,3], [1.0, 2.0, 3.0]]
or another permutation.
There may well be better ways of achieving the same result that didn't occur to me when I first wrote it.
@telegraphic Ah, that is what you meant. I made a very small change to that algorithm to work a bit better, but should not slow it down I think.
I guess this one can be closed now.
Not sure if this is intended behavior, but when hickling a dict that is empty and unhickling it, an empty list is returned instead of an empty dict. Hickling and unhickling a non-empty dict will work perfectly fine. Is this intended behavior or simply something that was overlooked?