radiocosmology / caput

Cluster Astronomical Python Utilities
GNU General Public License v2.0
12 stars 21 forks source link

fix(memh5): fix a bug where shared datasets weren't actually shared #239

Closed ljgray closed 3 months ago

ljgray commented 1 year ago

This PR makes the shared argument in MemDiskGroup actually share the MemDataset object and not just the underlying data array. Based on the idea of shared datasets being maintained through operations such as redistribution, axis downselection is not allowed for shared datasets.

It's important to note that the default behaviour of memh5.deep_group_copy is (and has been since inception) to only make a shallow copy of the MemDataset - i.e., the underlying array is actually shared. It doesn't matter when using this method to copy from disk -> memory or memory-> disk, but it does have implications when copying memory -> memory. This probably shouldn't be the default behaviour, but we would have to be extremely careful if making any changes to avoid breaking anything.

I've also updated the docstring for deep_group_copy to make note of the copy behaviour.

This lets us remove the .copy method in draco.ContainerBase as it will properly replicate that behaviour (see radiocosmology/draco#231)

Also, add an option to skip specific datasets when copying.

jrs65 commented 1 year ago

I have just been looking through this one again and reflecting on what the behaviour should be. It's not clear to me at what level a shared dataset should be shared. Really the only purpose of it is to save on memory, so anything that does that is fine. I'm not sure about shared datasets maintaining that through redistribution though, it also might be good to allow shared selections but that has its own implications.

ljgray commented 1 year ago

I think that it's valid for a user to expect that a shared dataset would remain shared through a redistribution, since that's a common operation and it defeats the purpose of memory savings if we lose those savings the moment we redistribute.

Selection is tricky since it would effectively have to replace the memdataset within the group with a new one with the selections applied, so I'm not sure how that would work without modifying the way memdatasets are referenced by the group (i.e., rather than the shared dataset just pointing to the same object, it would somehow have to point to a reference in the group tree which in turn references a dataset object, so if the dataset object changed the tree reference would still be the same thing). Unless you just mean applying selections during the copy, which has the potential to be tricky as well and probably wouldn't be used enough to be worth the effort (right now someone could just make the selection then make a shared copy, which would work fine)

ljgray commented 4 months ago

@ketiltrout I've updated this to simplify the copy pathways and address the to_file issues you mentioned. Now, there should only be two paths:

ljgray commented 4 months ago

@ketiltrout I also removed an extra byte order check, which was required for a now-fixed mpi4py bug.

ljgray commented 3 months ago

Yeah, this looks a lot better! I've made a few suggestions for your docstring.

One potential thing I've though of:

In the case when g2 is a file, instead of ignoring shared, another option would be to raise a ValueError on a non-empty value. I don't know if that's better or worse than what you've implemented.

ValueError could help users find catch where they've maybe misconstrued their call to deep_group_copy, but the downside is it forces any call to deep_group_copy to be type-aware (i.e. of the type g2), which may be a worse problem.

I've decided to meet in the middle and add a warning. I don't really think we would want this to fail entirely, but it's worth letting the user know that something is happening that they didn't expect

ketiltrout commented 3 months ago

Yeah, that's a good compromise.