"Compound" datasets - Githubissues

tischi commented 3 years ago

@axtimwalde @constantinpape @joshmoore

Based on the latest posts of @glyg in https://github.com/ome/ngff/issues/28 I was wondering about the following.

Let's say we have, e.g. a FLIM data set.

I think it could be useful to store it as a 5D data set with these dimensions:

x
y
z
t
c (intensity, lifetime)

Where, in this case, my feeling is that the c dimension is qualitatively different from the other dimensions. Because "moving along the c-axis" will change the unit of the output value (which is not the case for any of the other dimensions).

unit(data[0,0,0,0,0]) = grayValue
unit(data[0,0,0,0,1]) = nanoseconds

Are you guys having any thoughts on this? I mean, should we treat such dimensions that change the unit of the output value differently than other dimensions?

axtimwalde commented 3 years ago

Good point! I think in my mind, c is not really a dimension here and I would save this as two datasets. But it is also a good argument to support compound types. I had a conversation with @SabineEmbacher last week and we both agree that supporting compound types would (a) be useful, and (b) should be done with an annotation similar to N5 Compressions, such that compounds that can be extracted from byte streams can be registered at class loading time.

tischi commented 3 years ago

I think in my mind, c is not really a dimension here and I would save this as two datasets.

In my mind, these are two datasets as well, but I think there is the argument of having the ability to store all "channels" in one chuck, for loading efficiency. With the current specification I think we would have to put them into the same dimension, isn't it? In other words, data from different datasets cannot be in the same chunk, right?! @joshmoore

axtimwalde commented 3 years ago

storing values long a dimensional axis means that they share a bunch of properties. Technically, they must be of the same type, and depending on how the spec is phrased, we may want to enforce that they also have the same dimensional type and unit. In your concrete example, I speculate that you would like to store intensity in an unsigned integer type (uint16?) and lifetime in a floating point type (float64?). This means they have to be two datasets or you make a compromise, muddling the waters. I think compound types should be the thing here. I need to educate myself about how data comes out of Python. There are a bunch of competing approaches in the Python world. Zarr dtype lists:

https://zarr.readthedocs.io/en/stable/spec/v2.html#data-type-encoding

Numpy Array interface:

https://numpy.org/doc/stable/reference/arrays.interface.html#arrays-interface

Numpy dtype:

https://numpy.org/doc/stable/reference/arrays.dtypes.html

which all seem to describe the same thing but all with a different syntax. None is clear about how variable length data is expressed. The pointer part is moderately obvious |O or something but I cannot see where the data is.

tischi commented 3 years ago

Technically, they must be of the same type, and depending on how the spec is phrased, we may want to enforce that they also have the same dimensional type and unit.

My gut feeling would be to enforce same type and unit at least for the next version of ome.zarr and tackle compound data types in later releases.

axtimwalde commented 3 years ago

I agree. So for now this would be two datasets.

ome / ngff

"Compound" datasets #37