telegraphic / hickle

a HDF5-based python pickle replacement
http://telegraphic.github.io/hickle/
Other
485 stars 70 forks source link

Hickling an empty dict does not return one when unhickled #91

Closed 1313e closed 5 years ago

1313e commented 5 years ago

Not sure if this is intended behavior, but when hickling a dict that is empty and unhickling it, an empty list is returned instead of an empty dict. Hickling and unhickling a non-empty dict will work perfectly fine. Is this intended behavior or simply something that was overlooked?

telegraphic commented 5 years ago

Hi @1313e, you are good at finding edge cases 🥇! No, that is not intended behaviour, likely to do with the check if something is iterable and with non-homogeneous data types.

1313e commented 5 years ago

I do know that hickling an empty dict does mark it correctly as a dict inside the HDF5 file, so I assume it has to do with the loading process.

Hi @1313e, you are good at finding edge cases ! No, that is not intended behaviour, likely to do with the check if something is iterable and with non-homogeneous data types.

Well, that is how I like to write my code: Taking care of all situations I can think of.

telegraphic commented 5 years ago

Dictionaries have indeed caused some pain in the past, I added support for unicode and integer keys back in #36. Theoretically any hashable type can be used for a key but not so for HDF5 dataset names :/

Nb: also likely related to #47.

1313e commented 5 years ago

Dictionaries have indeed caused some pain in the past, I added support for unicode and integer keys back in #36. Theoretically any hashable type can be used for a key but not so for HDF5 dataset names :/

Nb: also likely related to #47.

I was actually just about to mention that I tried hickling a dict that uses tuple keys, which seems to not work properly. Try doing the following:

>>> hickle.dump({(1, 2): 1}, 'test.hdf5')
>>> hickle.load('test.hdf5')
[(1, 2)]

This may be related to the issue with empty dicts being returned as empty lists.

1313e commented 5 years ago

@telegraphic May I ask you what the purpose is of this bit in helpers.sort_keys:

to_int = lambda x: int(re.search(b'\d+', x).group(0))
keys_by_int = sorted([(to_int(key), key) for key in key_list])
return [ii[1] for ii in keys_by_int]

Why would it be necessary to sort the keys in order? Not only are there easier ways of doing that (use a sortedlist for example), but keys in a dict are never sorted anyways.

I am asking, since I fixed the problem with empty dicts not being returned as empty, but these lines mess that up. If I disable them and simply return key_list, all tests check out perfectly fine.

EDIT: Ah, I found why that thing exists. I am going to have a look at it.

1313e commented 5 years ago

Alright, I have fixed the problem with empty dicts not being unhickled as empty dicts. However, this does not yet fix the problem with using tuples as dict keys. I am looking into that now.

1313e commented 5 years ago

And I have now also fixed the problem with tuples not being usable as dict keys.

telegraphic commented 5 years ago

^^ Yeah that code above is terse. Sounds like you've figured it out, but used to support reconstruction of lists with heterogeneous types, e.g. [1, 1.0, 'a']. This approach is slow if you have really large heterogeneous lists, but was the most sane way to support mapping into hdf5 data structures that I could think of at the time.

1313e commented 5 years ago

^^ Yeah that code above is terse. Sounds like you've figured it out, but used to support reconstruction of lists with heterogeneous types, e.g. [1, 1.0, 'a']. This approach is slow if you have really large heterogeneous lists, but was the most sane way to support mapping into hdf5 data structures that I could think of at the time.

I have not made any changes to how heterogeneous lists are saved or loaded at all. So, I am failing a bit to see what the connection here is. Care to explain?

telegraphic commented 5 years ago

Ah, of course. Going from memory here, but say you had:

a = [[1,2,3], [1.0, 2.0, 3.0], ['1', '2', '3']]

If you tried to store that using h5py.create_dataset, it would fail due to the different types, But if you had:

a = [[1,2,3],[1,2,3], [1,2,3] ]

Then storing the whole thing as a single dataset is A-OK. So, in the first case hickle will create a set of datasets within a group:

\group
    \data_0  - [1,2,3]
    \data_1  - [1.0, 2.0, 3.0]
    \data_2  - ['a', 'b', 'c']

HDF5 does not maintain order with these keys, so they are sorted first, then passed to the loader. Otherwise you may end up with

[['a', 'b', 'c'], [1,2,3], [1.0, 2.0, 3.0]]

or another permutation.

There may well be better ways of achieving the same result that didn't occur to me when I first wrote it.

1313e commented 5 years ago

@telegraphic Ah, that is what you meant. I made a very small change to that algorithm to work a bit better, but should not slow it down I think.

1313e commented 5 years ago

I guess this one can be closed now.