uchicago-cs / deepdish

Flexible HDF5 saving/loading and other data science tools from the University of Chicago
http://deepdish.io
BSD 3-Clause "New" or "Revised" License
270 stars 59 forks source link

Add simplenamespace io #9

Closed twmacro closed 8 years ago

twmacro commented 8 years ago

This PR adds native save/load of Python SimpleNamespace objects to deepdish. The implementation is such that older versions of Python (that do not support SimpleNamespace types) will seamlessly load them as dictionaries. The current version of deepdish supports the SimpleNamespaces via pickling, which has obvious downsides.

Our group uses dictionaries and SimpleNamespaces very often in our work: dictionaries when the keys are variables, SimpleNamespaces when the keys are fixed.

From "The Zen of Python": Namespaces are one honking great idea -- let's do more of those!

twmacro commented 8 years ago

Added soft links. This takes care of multiple names referring to the same object and maintains those relationships upon load. Also allows for recursion. For example:

import numpy as np
import deepdish as dd
A = np.random.randn(3, 3)
d = dict(A=A, B=A)   # two objects point to same matrix
d['C'] = d           # add a recursive member
dd.io.save('test.h5', d)
d2 = dd.io.load('test.h5')

From inside IPython:

In [2]: d
Out[2]: 
{'A': array([[-0.38067395, -0.00706444,  0.8047854 ],
        [ 0.4510418 ,  0.36732588, -0.54035383],
        [-2.35721862, -0.49819387,  0.70847998]]),
 'B': array([[-0.38067395, -0.00706444,  0.8047854 ],
        [ 0.4510418 ,  0.36732588, -0.54035383],
        [-2.35721862, -0.49819387,  0.70847998]]),
 'C': {...}}

In [3]: d2
Out[3]: 
{'A': array([[-0.38067395, -0.00706444,  0.8047854 ],
        [ 0.4510418 ,  0.36732588, -0.54035383],
        [-2.35721862, -0.49819387,  0.70847998]]),
 'B': array([[-0.38067395, -0.00706444,  0.8047854 ],
        [ 0.4510418 ,  0.36732588, -0.54035383],
        [-2.35721862, -0.49819387,  0.70847998]]),
 'C': {...}}

In [4]: d2['A'] is d2['B']
Out[4]: True

In [5]: d2['C'] is d2
Out[5]: True
twmacro commented 8 years ago

Oh, and the ddls output on that file is:

$ ddls test.h5
/                          dict
/A                         link -> /B [SoftLink]
/B                         array (3, 3) [float64]
/C                         link -> / [SoftLink]
gustavla commented 8 years ago

This looks great. Thanks for the PR!

Is there a connection between SampleNamespace and SoftLink? If not, I would rather see that dealt with in two separate PRs. I felt like I was getting ready to merge the SampleNamespace, but I need to give the SoftLink more thought, so it would be better if we could treat those two separately. What do you think?

twmacro commented 8 years ago

That would be excellent! I was keeping these separate myself, but then I cleaned up ls.py a little in the new branch. Bad me! Let me know if I need to do anything ... this is my first PR. :-)

twmacro commented 8 years ago

Here's my plan: remove the soft link stuff from the latest commit to just retain the final, cleaned up SimpleNamespace stuff. That should get this PR back in proper order I think. Then, after we're done with this one, I'll build the second PR based on the final results. Sound good?

gustavla commented 8 years ago

Sounds good. You can always abandon this PR and open two brand new ones, if you find that easier. Then you could just cherry-pick the commits pertaining to each feature. I'm fine with either way though.

twmacro commented 8 years ago

Okay, this one should be ready for your review. One thing to note is that I did not update any version settings (like the IO_VERSION).

gustavla commented 8 years ago

Yes, you should definitely bump IO_VERSION to 9.

gustavla commented 8 years ago

In this update, the root / is always displayed on a separate line. This makes sense to distinguish between dict and SimpleNamespace, but it does add an unnecessary line to all single-array data files, which is a very common use case.

How about leaving the output as is if SimpleNamespace is not used in the root, and simply add that line only if it's used. I'd love to hear your thoughts on this.

gustavla commented 8 years ago

Also, in versions <3.3, silently falling back to dictionaries could be dangerous. If a SimpleNamespace is expected, the code will break if that's not returned. Although the fall-back certainly would be convenient during an interactive session if someone gives you a file with a SimpleNamespace and you are sitting on <3.3.

I think at least a warnings.warn should be issued if a SampleNamespace:-annotated group is encountered and _sns is False. Thoughts? I can add this after merging the PR too.

twmacro commented 8 years ago

Excellent points; I concur on both counts. Are you planning to add both updates? Note that the io.rst file has that root line printed in the sample outputs.

gustavla commented 8 years ago

I can do both points, it's up to you. Let me know if you want me to this myself post-merge.

Also, I might trim down the SimpleNamespace section of the docs too. Currently, I think this feature is pretty niched and given that it shares semantics with dictionaries, it doesn't need that much explanation.

Thanks for the PR by the way! Much appreciated.

twmacro commented 8 years ago

That's fine with me. :) The README.rst file also has the extra line.

gustavla commented 8 years ago

OK, in that case I'm merging this now. Thanks again!

twmacro commented 8 years ago

Awesome! Thank you so much for such an excellent package. I thought I was going to have to write this myself. :-)