Closed bmaranville closed 1 year ago
Is there a functional requirement to require h5py>=3.7 for the code to work? Otherwise I'd change that to 3.1 to allow python 3.6 to install it.
I think the version requirement is to support track_order on the root group, so that the datasets are loaded in the order they were saved (see https://github.com/h5py/h5py/issues/1097).
Do we care if the order of the datasets is preserved? I can't remember if keeping order is part of the spec.
Actually it looks like the workaround described in the h5py issue above (which we are already using) works fine back to h5py=3.1 and python=3.6 (I just tried out that combination).
So we can remove that version requirement for h5py.
Do we care if the order of the datasets is preserved? I can't remember if keeping order is part of the spec.
I don't think that was ever specified. In the text file this is always the case. I don't see the need in the nexus as all headers are present in each dataset. Maybe if you want to presever plotting order it is nice.
What does everyone think the merge process for this looks like? There may be a desire for a bit of churn in this kind of file as we experiment and learn.
Can we make backwards incompatible changes after we've already made a release, if we think we've improved the state of the code? I'm happy for there to be churn and for those backcompat breaks to exist.
I'm interested in starting to use an HDF representation because I've got some data that needs a detailed resolution kernel, and I'd like to investigate using ORB to do this.
A major breaking change from this that I can foresee is a proposed re-ordering of the dataset arrays on the orsopy side (the representation in the .ort and .orb files would be unchanged), as described in #107, which specifically addresses your use case of multidimensional "columns"
By the end of the comments on that PR I think there was at least tentative agreement on a way forward (use lists of "columns" to hold the data instead of a 2-d array with column as the 2nd index in the orsopy structure).
I agree that we might need to go through a bit of churn at the outset in order to get stable support for the HDF5 structure in the long term.
I read through #107, and I'd like to understand more clearly. What do we mean by "lists of columns"? Considering an HDF point of view it makes little sense to store a 2D array, that's something that is a limitation of ASCII.
For .orb does "lists of columns" mean creating a new dataset for each of the 'columns'?
e.g. something along the lines of:
Qz = np.linspace(0.01, 0.5, 101)
R = np.random.random(101)
dR = np.random.random(101)
# dQz is multidimensional, and is the shape I'd like to reconstitute
dQz = np.random(101, 2, 51)
with h5py.File(fname, "w") as f:
f.create_group("entry1/data")
f.create_dataset("entry1/data/Qz", Qz)
f.create_dataset("entry1/data/R", R)
f.create_dataset("entry1/data/dR", dR)
f.create_dataset("entry1/data/dQz", dQz)
? With the 'column' names then listed appropriately (a la NeXus/NXcanSAS) in attributes (that could also specify ordering of how they're supposed to be loaded)?
I guess there is a further issue, how those arrays are made available through OrsoDataset.data
. At the moment OrsoDataset.data
is supposed to be (npnts, ncols)
. A 2-D array is not sensible if one wants to store multidimensional data. Could we change the data
attribute to be Union[dict, np.ndarray]
? If it's a dict then the keys are the column names and the values are the (multidimensional) arrays. If it's an np.ndarray then it's what we have now. The dict would be writeable to .orb, and a subset to .ort. The np.ndarray would be writeable to either.
There are two considerations: how the data is stored in the file formats (.ort and .orb) and how the data is represented in the orsopy python objects. The proposal in #107 changes only the the python objects. In the .orb files, as you suggest above, I believe it makes sense to create a separate HDF5 dataset (not to be confused with orsopy OrsoDataset) for each column in the data, and I don't believe that to be controversial.
For the python objects, Instead of storing a numpy ndarray of shape (npnts, ncols)
, in OrsoDataset.data, what is suggested in #107 is that a python list of numpy ndarray objects (one per "column") is stored,. A minimal validation is done to ensure that the first dimension has the same length for each object,. In the simple case where all arrays are one-dimensional) a single combined 2d numpy.ndarray could also be accepted, though it would make sense to use a shape of (ncols ,npts)
instead, so that the order of dimensions is the same for List[ndarray] and simple ndarray (2d). Support for this is trivial when the list of column data is unpacked with (for data in columndata...) and is included in #107.
So what is proposed is that the type of OrsoDataset.data be Union[List[numpy.ndarray], numpy.ndarray]
(where the second form should have 2 dimensions). We might want to just have one type for the stored data but automatically convert an array input to list of arrays in __init__
For .ort files, we could support multidimensional columns by adding new metadata to the column description (e.g. name="angular_resolution" index=[1]), but this is not implemented yet.
I think I'd also be ok with a dict as you suggested, but currently column metadata is stored as a list, and I was trying to make as small a change as possible. During the write process to an HDF it is converted to exactly the dict you describe, but also the "sequence_index" is added as an attribute so it can be reconstructed into a list on read.
Thanks for that clear explanation. I get what's being proposed now. List of columns, yeah let's go.
What's the roadmap to going forwards then? Is it merge #107 then this?
@bmaranville @aglavic do you think we can merge this PR? Or are there issues that need to be addressed before it gets merged? Does this PR get merged simultaneously with #107? I can commit to review that.
We can do this one first, if you like. The changes for #107 are pretty small and can be done after.
Can we make backwards incompatible changes after we've already made a release, if we think we've improved the state of the code? I'm happy for there to be churn and for those backcompat breaks to exist.
My take on this is that we should merge this as soon as it's ready (maybe now) and make a minor revision. I don't see an issue with breaking changes, if they only concern the binary representation, as we should not release that officially, yet. My assumption is, that 90% of people implementing the binary standart within the next couple of years are within this project, so a breaking change could be fixed quite easily.
What I would try to avaoid is a change that breaks the text format and reading of important header information through orsopy (e.g. changing of keys or orsopy main class interface attributes). But I don't remember any of the conversations pointing towards such an issue.
@aglavic, @bmaranville I propose that we merge this as-is, then make Artur's suggested changes afterwards. It'd be good to get some movement going again.
This version of the PR is separate from #107, and implements a NeXus save/load with the current data order.
I added some tests as suggested by @andyfaff and it roundtrips the data structures successfully.
We can probably delete the previous PR #97 which also implemented the change of data order; as @aglavic pointed out that is a breaking change and we've already released v1.0, so we need to come to an agreement on that issue separately.