Closed ljgray closed 3 months ago
I have just been looking through this one again and reflecting on what the behaviour should be. It's not clear to me at what level a shared dataset should be shared. Really the only purpose of it is to save on memory, so anything that does that is fine. I'm not sure about shared datasets maintaining that through redistribution though, it also might be good to allow shared selections but that has its own implications.
I think that it's valid for a user to expect that a shared dataset would remain shared through a redistribution, since that's a common operation and it defeats the purpose of memory savings if we lose those savings the moment we redistribute.
Selection is tricky since it would effectively have to replace the memdataset within the group with a new one with the selections applied, so I'm not sure how that would work without modifying the way memdatasets are referenced by the group (i.e., rather than the shared dataset just pointing to the same object, it would somehow have to point to a reference in the group tree which in turn references a dataset object, so if the dataset object changed the tree reference would still be the same thing). Unless you just mean applying selections during the copy, which has the potential to be tricky as well and probably wouldn't be used enough to be worth the effort (right now someone could just make the selection then make a shared copy, which would work fine)
@ketiltrout I've updated this to simplify the copy pathways and address the to_file
issues you mentioned. Now, there should only be two paths:
shared
. In the case where to_file
is True, dataset names provided in shared
are ignoredMemDataset
objects as the original group.@ketiltrout I also removed an extra byte order check, which was required for a now-fixed mpi4py bug.
Yeah, this looks a lot better! I've made a few suggestions for your docstring.
One potential thing I've though of:
In the case when
g2
is a file, instead of ignoringshared
, another option would be to raise aValueError
on a non-empty value. I don't know if that's better or worse than what you've implemented.
ValueError
could help users find catch where they've maybe misconstrued their call todeep_group_copy
, but the downside is it forces any call todeep_group_copy
to be type-aware (i.e. of the typeg2
), which may be a worse problem.
I've decided to meet in the middle and add a warning. I don't really think we would want this to fail entirely, but it's worth letting the user know that something is happening that they didn't expect
Yeah, that's a good compromise.
This PR makes the
shared
argument inMemDiskGroup
actually share theMemDataset
object and not just the underlying data array. Based on the idea of shared datasets being maintained through operations such as redistribution, axis downselection is not allowed for shared datasets.It's important to note that the default behaviour of
memh5.deep_group_copy
is (and has been since inception) to only make a shallow copy of theMemDataset
- i.e., the underlying array is actually shared. It doesn't matter when using this method to copy from disk -> memory or memory-> disk, but it does have implications when copying memory -> memory. This probably shouldn't be the default behaviour, but we would have to be extremely careful if making any changes to avoid breaking anything.I've also updated the docstring for
deep_group_copy
to make note of the copy behaviour.This lets us remove the
.copy
method indraco.ContainerBase
as it will properly replicate that behaviour (see radiocosmology/draco#231)Also, add an option to skip specific datasets when copying.