ome / ngff

Next-generation file format (NGFF) specifications for storing bioimaging data in the cloud.
https://ngff.openmicroscopy.org
Other
119 stars 41 forks source link

total coordinates #199

Open d-v-b opened 1 year ago

d-v-b commented 1 year ago

problem statement

Although we store images as N-dimensional arrays, it's important to remember that images are made of individual measurements / samples, and that in many imaging modalities each sample has its own coordinate metadata. I'm going to use the term "total coordinates" to describe coordinate metadata for each sample of an image; this stands in contrast to coordinate metadata that assigns a single coordinate value to an entire image. OME-NGFF does not support total coordinates. Failing to do so means we are under-describing our images, which impairs the ability to properly analyze, share, and reproduce bioimaging datasets.

examples

I will illustrate this with a few examples:

Note that these are not mutually exclusive scenarios -- a point-scanning system with field curvature my use a fast z-scanner for rapid 3D imaging. If speed demands it, we can make the point-scanning bidirectional, and the z-scanning as well, which means that temporally adjacent samples / planes are no longer spatially adjacent. Encoding "total coordinates" for this image would require storing 3 spatial coordinates and one temporal coordinate for each individual sample.

matrix transformations don't work

I note that matrix transformations completely fail here, because

Unsatisfying alternatives include raster-scanning-specific metadata, field-curvature-specific metadata, etc, but this seems like a pretty ugly way to solve a general problem, and it's not easily extensible.

total coordinates in ome-ngff

Is there demand for storing these kind of coordinates in ome-ngff? If not, we can close this issue, with the caveat that we are under-representing our imaging data.

It will surprise no one that my proposed solution is to use the CF dimensions + coordinates model.

This has come up before:

bogovicj commented 1 year ago

Depending on how general what you have in mind is, I think much of this can be covered with the coordinates type of transformation and that I expect will be in v0.5 (and is close to the CF coordinates model).

It allows one to explicitly (with an array) specify the coordinates at every sample of some other array (or some general coordinate system, but let's ignore that for now).

A pretty general example

Say we have a 2D array (size N x M) and want to encode 4 coordinate measurements at every sample:

We'd store an array of size 4 x N x M where that first dimension holds the [p,t,y,x] coordinates.

Limitations

All coordinates need to be of the same type (usually floating point).

I've spec'd that all coordinates need to be stored in the same array, and I believe I recommend that the coordinate dimension be the fastest varying, since most use cases that I'm aware of benefit from that access pattern.

For example, We could not assign a string valued coordinate to one dimension, and that might be nice if we had a channel dimension, say.

One way we might address that limitation

(if we need to)

A way to generalize my spec in a nice way, could be to have a the array that stores coordinates be a vector type (we've talked about this before). So for the above example, the coordinate array might instead by N x M that stores 4-vectors. If we allow those vectors to be of mixed type, then it would be fine to store something like [0.5, 1.1, "DAPI", 2229] for a coordinate, and that could be nice.

If there are use cases for that, I'd be happy to talk through something like this after v0.5 is done.

d-v-b commented 1 year ago

Thanks john, that's super helpful! I missed / forgot about that part of your PR. I will continue the conversation about that particular proposal over in #138

I'll keep this issue open for general discussion of the idea.

clbarnes commented 1 year ago

Are total coordinates even compatible with chunked storage? Presumably you have to hope that the coordinates are monotonically increasing in each dimension, but even then, you'd potentially have to go fishing to find the chunk any given coordinate is in, and probably store some information about average coordinate spacing to even start looking in the right place.

Also, is this very different to #178 , in the limit?

d-v-b commented 1 year ago

Are total coordinates even compatible with chunked storage? Presumably you have to hope that the coordinates are monotonically increasing in each dimension, but even then, you'd potentially have to go fishing to find the chunk any given coordinate is in, and probably store some information about average coordinate spacing to even start looking in the right place.

I don't see any intrinsic conflict between total coordinates and chunked storage. Xarray basically implements what I'm describing, and they have no issue using dask arrays or zarr arrays to store coordinate variables.

Also, is this very different to https://github.com/ome/ngff/issues/178 , in the limit?

Yes, it is very different. I'm proposing associating each sample of an image with additional data. So for a given image, you can't have more elements in a coordinate variable than samples in your image. But the points described in #178 are not associated with individual samples of an image, they are vectors in continuous space, and you could (in principle) have many more points in an image than you have discrete samples.

clbarnes commented 1 year ago

Don't xarray's coordinates still describe a grid (there may be another feature I'm not aware of)? Even if the spacing in either axis isn't regular, it's still made up of squares. If you had total coordinates (which I'm interpreting as "points aren't necessarily on a grid, although possibly not far off"), how would you decide which chunk to look for a particular point in, when indexing with world coordinates? Around the chunk edges you'd get some bleeding across, where a point voxel-ly in one chunk is world-ly in the next chunk over. Near the corners of a 2D chunk you might have to look in 4 chunks, 8 chunks for 3D, etc.. That feels more like a spatial lookup problem, like that in the other issue.

d-v-b commented 1 year ago

Maybe it's easier to express with an example, which simulates a small 2D image from a raster-scanning imaging system (e.g., a confocal microscope, or an SEM):

import numpy as np
from xarray import DataArray

shape = (9,9)
sampling_interval = .1

# cool data that came from the instrument as 2D image
data = np.random.randint(0, 255, shape)

# an array of values simulating spherical distortion in z
z = np.hypot(*np.meshgrid(np.arange(-4,5), 
                               np.arange(-4,5)))

# an array of timepoints that simulates bidirectional scan timing
t = (np.ones(np.prod(shape)) * sampling_interval).cumsum().reshape(shape)
for y in range(shape[0]):
    if y % 2 == 1:
        t[y] = t[y][::-1]

xdata = DataArray(data, 
          dims=('x','y'),
          coords={'x': ('x', np.arange(shape[0]) * .2, {'units': 'nm'}),
                  'y': ('y', np.arange(shape[1]) * .1, {'units': 'nm'}),
                  'z': (('y', 'x'), z, {'units': 'nm'}),
                  't': (('y', 'x'), t, {'units': 'ms'})},
         attrs = {'foo': 'bar'})

The data itself is 2D, because that's what comes out of the instrument, which is why we have 2 dims. but! each point of the data came from an event in real life (i.e., 4D space). To describe that process more completely we need some way to associate z values, and t values, to each one of our samples. We accomplish this by defining an array of z values, and an array of t values, and adding them in as multidimensional coordinates.

note that this is not a contrived example -- any sufficiently fast raster-scanning scope is described by this data structure. we just don't typically represent the data from these scopes very precisely.

As far as I can tell, the usefulness of this representation is orthogonal to chunking, or using a coordinate as an index (which of course you could do here, e.g. by using regular numpy operations on the coordinate arrays).

d-v-b commented 1 year ago

Maybe coining the term "total coordinates" was confusing, and I should have said "CF conventions / xarray data model", but I worry that this turns people off of the idea 😅

bogovicj commented 1 year ago

@clbarnes

Don't xarray's coordinates still describe a grid

This is my understanding also, and would like to know if I'm missing something.

@d-v-b

I should have said "CF conventions / xarray data model"

My understanding of this proposal is that it is more general than xarray's model (though iirc CF has something more general). Where xarray is not as general as the coordinates transform I mention above is in the way @clbarnes described - it describes an axis-aligned grid. Every discrete slice of a multi-dim array must have the same coordinate value. In your example, every sample in data[0,:] would have x=0 right?

xarray's coordionate model is supported in my PR using a combination of coordinates and by_dimension.

I'm sure its possible to flatten data and have a coordinate for every sample, but I'm sure this is a typical or intended use.

d-v-b commented 1 year ago

@clbarnes

Don't xarray's coordinates still describe a grid

I'm not sure if I understand this question. Maybe the term "grid" is overloaded here, because the image data itself is a grid by definition, and so any array with shape isomorphic to the image data can be said to "describe a grid". I chose the word "isomorphic" here to include the case of 1D coordinate variables like x and y in my example that are effectively a compressed representation of an ND array.

But even if the image data is stored as a grid, that doesn't mean it was actually acquired from a perfect grid. In fact, I think image data is rarely, if ever, acquired from a truly perfect grid 4D. There is no constraint in xarray that requires that the coordinate data represented by a coordinate variable represent a perfect grid.

@bogovicj

Every discrete slice of a multi-dim array must have the same coordinate value. In your example, every sample in data[0,:] would have x=0 right?

Only because x was defined as a 1D coordinate variable. 2D coordinate variables like t and z do not have this property. Both t and z have a coordinate value for each sample in the data. Note that this is not a contrived scenario -- field curvature and sample acquisition times are real facts about fast raster scanning microscopy.

xarray's coordionate model is supported in my PR using a combination of coordinates and by_dimension.

Can you show an example of how would you express the t and z coordinate variables from my example with coordinates and by_dimension?

clbarnes commented 1 year ago

The angle I'm coming from is this:

In all of these cases, you can go from world space to coordinate to voxel coordinate and back easily; if you have a world coordinate you don't have to go fishing to find the voxel coordinate and therefore the chunk coordinate (except in the coordinate array, but that's low-dimensionality and so not much of a problem).

Point clouds are stored in a tree of whatever kind; looking up the data for given world coordinate is considerably harder, and storage is much less efficient.

If I'm understanding you correctly, you're looking for something which sits in between those two extremes: something which is stored like an array, but can be loaded more like a point cloud (where the intensity of a voxel can be looked up by the somewhat-arbitrary coordinates associated with it). This feels like a worst of both worlds situation - you may be able to guess a region of array that any given point might lie in, given its world coordinates, but you don't get the spatial tree's lookup efficiency, or the array's predictability of where that point might be. If a voxel is on the corner of one chunk, the coordinates associated with it might mean that its world coordinates put it closer to the samples in the next chunk(s) over. So which chunk do you load in order to find the intensity, or do you just fish around until you find it?

It totally makes sense to store this data in order to feed into a registration step (which John is much more qualified to speak on, of course!) so that you can then get it into a more access-efficient format (e.g. an axis-aligned grid). But I'm struggling to see an efficient way to use it for lookups more generally.

d-v-b commented 1 year ago

If I'm understanding you correctly, you're looking for something which sits in between those two extremes: something which is stored like an array, but can be loaded more like a point cloud (where the intensity of a voxel can be looked up by the somewhat-arbitrary coordinates associated with it). This feels like a worst of both worlds situation - you may be able to guess a region of array that any given point might lie in, given its world coordinates, but you don't get the spatial tree's lookup efficiency, or the array's predictability of where that point might be. If a voxel is on the corner of one chunk, the coordinates associated with it might mean that its world coordinates put it closer to the samples in the next chunk(s) over. So which chunk do you load in order to find the intensity, or do you just fish around until you find it?

Look at the example I gave here. That's what I'm looking for. There's nothing in that example that resembles a point cloud. There's a 2D image, with dimensions named y and x; dimension y is associated with a 1D array named y, and dimension x is associated with a 1D array named x. Both of these arrays contain regularly spaced values, but they didn't have to. Alone, these two coordinate variables are sufficient to associate each element of the 2D image with a pair of values that define a position in a 2D space; but as per the description of the example (raster scanning imaging data from real life), we need to specify the z and t values for each sample. This is accomplished via z and t arrays, which are both 2D, because the z and t values vary along the two dimensions of the data. This is a compact description of a single image produced by a raster-scanning microscope. There's nothing to do with point clouds or chunking in this example, so I'm not sure where those come in.

If I'm understanding you correctly, you're looking for something which sits in between those two extremes: something which is stored like an array, but can be loaded more like a point cloud (where the intensity of a voxel can be looked up by the somewhat-arbitrary coordinates associated with it).

I don't think this portrayal is correct, and furthermore (going back to the xarray code example), the z and t coordinates of individual samples from a microscopy image are anything but arbitrary!

I ask you to consider the following question: given that raster-scanning images are 2D, but each sample of the image is acquired sequentially (i.e., at different times), and at different axial positions (because field curvature), how would you represent the image data + x + y +z + t coordinates as a data structure? That's the central problem here. At this level of detail, it has nothing to do with point clouds or chunks or even storage.

bogovicj commented 1 year ago

Can you show an example of how would you express the t and z coordinate variables from my example with coordinates and by_dimension?

Happy to. We'll need four zarr arrays:

First we need to

define coordinate systems ```json { "coordinateSystems": [ { "name": "/my/data", "axes": [ {"name": "x", "type": "array"}, {"name": "y", "type": "array"} ] }, { "name": "out", "axes": [ {"name": "x", "type": "space", "unit": "nm"}, {"name": "y", "type": "space", "unit": "nm"}, {"name": "z", "type": "space", "unit": "nm"}, {"name": "t", "type": "time", "unit": "ms"} ] } ] } ```

Using only coordinates and byDimension, transform the arrays coordinates to the 4d output coordinates:

{
    "coordinateTransformations": [ 
        {
            "type": "byDimension",
            "input": "/my/data",
            "output": "out",
            "transformations" : [
                { "type": "coordinates", "path": "xCoords", "input": ["x"], "output": ["x"]},
                { "type": "coordinates", "path": "yCoords", "input": ["y"], "output": ["y"]},
                { "type": "coordinates", "path": "zCoords", "input": ["y","x"], "output": ["z"]},
                { "type": "coordinates", "path": "tCoords", "input": ["y","x"], "output": ["t"]},
            ]
        }
    ]
}

As you'd expect, it mirrors pretty closely the xarray code. Because, for example, the line
{ "type": "coordinates", "path": "zCoords", "input": ["y","x"], "output": ["z"]}, communicates literally the same information as 'z': (('y', 'x'), z, {'units': 'nm'}),

For this example, I'd prefer using a scale transformation for x and y like this:

{
    "coordinateTransformations": [ 
        {
            "type": "byDimension",
            "input": "/my/data",
            "output": "out",
            "transformations" : [
                { "type": "scale", "scale": [0.2, 0.1], "input": ["x", "y"], "output": ["x", "y"]}, 
                { "type": "coordinates", "path": "zCoords", "input": ["y","x"], "output": ["z"]},
                { "type": "coordinates", "path": "tCoords", "input": ["y","x"], "output": ["t"]},
            ]
        }
    ]
}
d-v-b commented 1 year ago

Thanks @bogovicj, that example is super helpful!

@thewtex how does this look to you re: xarray compatibility?

clbarnes commented 1 year ago

I think we've been talking at cross purposes a bit - I do understand the information you're trying to express and that is probably the right way to do it (a separate 2(+channel)D array). I guess I was getting ahead of myself and thinking about how such data would be used but that is out of scope.