uchicago-cs / deepdish

Flexible HDF5 saving/loading and other data science tools from the University of Chicago
http://deepdish.io
BSD 3-Clause "New" or "Revised" License
270 stars 59 forks source link

add soft links to save shared objects just once and support recursion #11

Closed twmacro closed 8 years ago

twmacro commented 8 years ago

This PR adds soft links to deepdish. Primary, this means that a shared object will be written to the disk just once no matter how many names or other objects refer to it. These relationships are maintained upon load. Also allows for recursion.

A common use case for us is the saving of time vectors for accelerometer data or analysis data. Often, the time vectors are all the same, but sometimes they are all different. Similarly for frequency vectors in frequency response analysis. Also, sometimes it's handy to have a shortcut link (not unlike a file-system link) from within one dictionary to another "common data" dictionary in a large data structure. The Softlinks capability in HDF5 make adding this feature to deepdish almost trivial. :smile:

For example:

import numpy as np
import deepdish as dd
A = np.random.randn(3, 3)
d = dict(A=A, B=A)   # two objects point to same matrix
d['C'] = d           # add a recursive member
dd.io.save('test.h5', d)
d2 = dd.io.load('test.h5')

From within ipython:

In [2]: d2['B'] is d2['A']
Out[2]: True

In [3]: d2['C'] is d2
Out[3]: True

Here is a ddls view of the file:

$ ddls test.h5
/A                         array (3, 3) [float64]
/B                         link -> /A [SoftLink]
/C                         link -> / [SoftLink]
rnaero commented 8 years ago

Debugging and rerunning scripts that load large data files can slow down work flow. Adding soft links, as described above, will definitely speed up work flow. :)

gustavla commented 8 years ago

I really like the idea. My only concern is whether or not softlinks should be automatic or a manual option. It is possible that many people do not know about softlinks in HDF5 files and do not realize that they are creating them when placing an array at multiple locations in the file. If that file is then loaded, the user might assume that they are two distinct arrays. The manual option can be similar to dd.io.ForcePickle - something like:

dd.io.save('test.h5', dict(A=A, B=dd.io.SoftLink(A))

That way, expert users can use it with even more control, since they get to decide which one is the link and which one is the main one. For users unaware of softlinks, it won't introduce a subtle semantic that could lead to pitfalls. @twmacro Thoughts on this?

@rnaero, remember that you can also do partial loads using dd.io.load('file.h5', '/path/to/group') if it's too slow to load the whole file. You can even load just parts of a single array using dd.io.load('file.h5', '/path/to/group', sel=np.s_[:100]).

twmacro commented 8 years ago

Thank you @rnaero for your interest in this PR!

I definitely respect your careful and thoughtful approach @gustavla! That's no-doubt why deepdish is so good. I see your point too: it is possible that users could be relying on the save/load process as a method of ensuring deep-copies of everything. I think though there is more value in maintaining any original relationships ... not only should it be more efficient in space and time, it's a truer save/load. I would therefore prefer them to be automatic. :smile:

I don't think I'd use a B=dd.io.SoftLink(A) option ... at least, not directly. I don't know most of the relationships ahead of time so I'd probably have to write a function to build a temporary copy of the data structure (with dd.io.SoftLinks in it) and call that before calling dd.io.save. I think a more workable method would be something like:

dd.io.save('test.h5', dct, use_softlinks=True)

(In that case, I'd of course prefer that the default be True ... but that's above my pay-grade! :smile:)

twmacro commented 8 years ago

I just discovered that "pickle" does what this PR does. I stumbled across the following in the pickle documentation (https://docs.python.org/3/library/pickle.html#what-can-be-pickled-and-unpickled):

The pickle module keeps track of the objects it has already serialized, so that later references to the same object won’t be serialized again.

This has implications both for recursive objects and object sharing. [... stuff deleted ...]. Shared objects remain shared, which can be very important for mutable objects.

So, I tested it out:

import numpy as np
import pickle

value = np.random.randn(100, 23)
dct = dict(A=value, B=value)
newdct = pickle.loads(pickle.dumps(dct))

print("newdct['A'] is newdct['B']:",
      newdct['A'] is newdct['B'])

print("newdct['A'] is    dct['A']:",
      newdct['A'] is dct['A'])

Which gives this result:

newdct['A'] is newdct['B']: True
newdct['A'] is    dct['A']: False

Cool! :-)

gustavla commented 8 years ago

OK, you definitely have been convinced (good idea comparing to the semantics of pickling!). After testing this out, this is really cool and a very well put together PR.

I'm really sorry it took so long to merge this. The only final code comment I had was to give a better name for _load_level1, but don't worry I'll change that after it's merged.

Thanks so much for this!

twmacro commented 8 years ago

Awesome, thank you very much! :-)