What are correct metadata attributes when adding a zarr array

I have an image layer with prechunked data loaded with x,y,z dimension at 4,4,40nm voxel resolution. I want to add a zarr volume which is stored in z,y,x orientation with 40,8,8nm voxel resolution. I get the following source info after adding:

Screenshot from 2021-07-22 16-20-39

where the zarr array has the following .zattrs

{
    "_ARRAY_DIMENSIONS": ["z", "y", "x"],
    "offset": [
        0,
        0,
        0
    ],
    "resolution": [
        40,
        8,
        8
    ],
    "scale": 255
}

.zarray info

{
    "chunks": [
        15,
        304,
        304
    ],
    "compressor": {
        "id": "gzip",
        "level": 5
    },
    "dtype": "|u1",
    "fill_value": 0,
    "filters": null,
    "order": "C",
    "shape": [
        7063,
        67072,
        124416
    ],
    "zarr_format": 2
}

What is the correct way to do this? Perhaps I can add the resolution and bounds information directly to the metadata files.

As you found, specifying the _ARRAY_DIMENSIONS in the .zattr file works. That provides compatibility with xarray. However, neuroglancer currently does not support a way to specify the units or a non-zero offset for a zarr array, as I was not aware of any standard representation for that information, and was hesitant to invent one.

I am also not aware of any standard representation of that information, but I'd like to propose one. I don't think adding specific fields/attributes would break any reader of other tools. Why not use _ARRAY_OFFSET and _ARRAY_UNITS ? For the units one would also have to specify the length unit in some way, e.g. as [ (8, 'nm'), (8, 'nm'), (40, 'nm') ]. The nm could also be specified as _ARRAY_SCALES with ['nm', 'nm', 'nm'] which would probably be more consistent. What do you think?

We could double check with the zarr community. I can try to submit a pull request if we reach a consensus.

I think it would indeed be helpful to check with the zarr community to try to reach some consensus on these attributes.

Note that there is the additional issue of non-scalar and multi-field dtype, such as [["a", "<u2", [4, 5]], "b", "<i4", [3]]

Neuroglancer currently doesn't support that but in general it might be desirable to specify names, units, and offsets for those inner dimensions as well. Or maybe inner dimensions are sufficiently rare that it doesn't matter.

Within Neuroglancer offsets would be fairly simple to support, but for a library like tensorstore, should the offset affect indexing operations as well, or solely be used for visualization? If it affects indexing, then it would probably better belong in .zarray than in .zattr.

Okey, let us post an issue over at the zarr community if we come up with a reasonable proposal. I am sure they have thought about it as well, so I don't think it would be relevant for consideration into their core specs, and it might give be just an ok for the root field names which do not interfere with anything else.

The relevant docs in zarr for structured data types are here. Do you have a particular dataset in mind where you have come across this? How would you imagine this to be supported in Neuroglancer? If subfields are multidimensional, such as z in [["x", "<f4"], ["y", "<f4"], ["z", "<f4", [2, 2]]] I assume one would need to specify in the render code how to map e.g. the [2,2] array to emitted color?

They have a good way to specify time units for datateime64 datatypes, "<M8[ns]". So we could use this approach as well for standardized length units (nm, m ,...)?

Hmm, regarding the question of supporting offsets for indexing operations in libraries like tensorstore - I think it might be quite a dogmatic topic. I would tend to say to keep the offset only for visualization use cases for now, and keep the assumption of always having zero-offset indexing operations in tensorstore. So it would go to .zattrs. Btw, the funlib.geometry package implements a Roi class for physical indexing operation which also includes offset information.

Can you make a proposal of how you would specify the information we just discussed in .zattrs to make it optimally compatible with Neuroglancer?

As far as structured data types, no I haven't seen them used much myself, and in general the interleaved storage that it results in is probably in most cases less desirable than columnar storage. But I would suppose that if it is to be officially supported by zarr then the issue should be addressed as I imagine there are some users of structured dtypes.

Neuroglancer currently supports "m", "s", "Hz" and "rad/s" (along with all of the SI prefixes) as units. Mostly I think just "m" (for meters) and "s" (for seconds) are used. Possibly in the future it could support more, e.g. to know that "meters" and "m" are the same thing, for example. Units are tricky, though; on the one hand it is pretty important for neuroglancer to support SI prefixes at least, so that it can display "nm" and "um" at appropriate relative scales and show scale bars using SI prefixes, but on the other hand it would be nice to allow arbitrary unit identifiers to support varied use cases. I have taken a look at the units syntax used by the udunits2 library (https://www.unidata.ucar.edu/software/udunits/udunits-2.1.24/udunits2lib.html#Syntax), and it seems like a reasonable standard. However, udunits2 does depend on a units database; while users can define their own custom units, that is not really recommended by the udunits2 author.

I would suggest that zarr use a representation like:

"units": [[450, "nm"], [450, "nm"], [1, "um"], [1, ""], null]

The units for each dimension would be specified by either null, to indicate unspecified unit, or a two-element array specifying the multiplier and the base unit. The syntax for base units could be left unspecified, but with udunits2 syntax suggested. A dimension-less unit would be indicated by [1, ""]. If the units aren't specified, it would be equivalent to specifying [null, null, null, null, null].

As far as offsets: note that tensorstore itself supports non-zero origins, and for the neuroglancer precomputed format there is a non-zero origin via the "voxel_offset" attribute. I do think it might be somewhat confusing to have an integer offset vector for visualization but then not respect it when accessing the array programmatically; however, I can see how that would be problematic for zarr-python to support, and also there would be the backward compatibility issue that older versions of zarr would ignore the offset attribute even if newer versions respected it. If the offset is purely for visualization purposes, it might instead make more sense to allow non-integer offsets or a full affine transform matrix.

Currently zarr seems designed to pretty closely follow the numpy array data model, and I can see that there may be hesitancy to officially support non-zero origins and units as they are not part of the numpy array data model.

Note: An alternative could be to use a string to encode both the multiplier and the unit, e.g. "nm" or "4nm" or "4" (for dimensionless unit). That would be more consistent with the udunits2 syntax, but would require slightly more complicated parsing.

I suspect spatial metadata will be ruled out of scope for zarr-python. A related question came up here: https://github.com/zarr-developers/zarr-specs/issues/50, and the decision was made to punt the issue over to a higher abstraction level (the nascent OME-Zarr specification).

For my own purposes, I have been happy with expressing spatial metadata in Zarr / N5 containers as follows:

{
    "transform": {
        "axes": [
            "z",
            "y",
            "x"
        ],
        "scale": [
            5.24,
            4.0,
            4.0
        ],
        "translate": [
            0.0,
            0.0,
            0.0
        ],
        "units": [
            "nm",
            "nm",
            "nm"
        ]
    }
}

...but I am not pushing this as a standard, because I'm hoping something more encompassing comes out of community developments to standardize this kind of stuff: see https://github.com/ome/ngff/issues/28

Adding support for OME/ngff metadata to neuroglancer would also be reasonable, I think. I think it is a bit unfortunate that ome/ngff is rather narrowly focused on bio imaging, e.g. in that xyztc is specifically baked in, though it seems that it is moving in the direction of being more general.

Yes, we are making a constant effort to un-bake the magic dimension names from the spec :)

Thanks @d-v-b for pointing out the discussion around ome/ngff. There seems to be a lot going on, and it seems very nice that a lot of the tools already or have the intention of supporting ngff. I think it would make sense to perhaps contribute some of the discussion/proposals here into possibly augmenting a proposal for the axes labeling in the ngff specification. And then go about adding support for OME/ngff to Neuroglancer, instead of another special-purpose zarr metadata spec that is not supported by a wider community.

A few points for discussion may be:

specification/support of structured data types
overall structure of the json. more like the unit proposal from @jbms , or @d-v-b proposal with separate scale, units fields with arrays.
proposal by @jbms about using '' and null
what format to use for units, e.g. using udunits2 to specify the unit strings
I saw that there is a proposal of adding a spec for sequences of spatial transforms. And @jbms made the point about possibly supporting affines for visualization purpose. I'd find it super nice and useful to have a way to specify an affine for a dataset for visualization purposes, i.e. the affine is applied when rendering the image data in Neuroglancer. I was wondering whether the planned transform specification could be the suitable place to store/look-up this affine. Or whether it would be more closely related to the translate field you have above. There is also the problem of arrays with non-spatial dimension on how to select the relevant axes to apply the affine too. I think there was some discussion about it in the ngff issue, but haven't seen a conclusive solution yet.
how a translate field (in world units) would impact indexing operations e.g. in tensorstore. I haven't checked in detail, but is the idea to support indexing operation in world space/units, v.s. index space of the array data itself?

Thanks for your thoughts @unidesigner.

Re 1: What do you mean by "structured data types"? Numpy already supports a form of structured data type (https://numpy.org/doc/stable/user/basics.rec.html), which is available in zarr since zarr uses the numpy data model. However, I don't think it is particularly widely used, and columnar rather than the existing interleaved ("row") storage would probably be more suitable for most purposes. Currently Neuroglancer does not support these data types, though. Is that what you are asking about?

Re 5: For specifying affine transforms, I think the simplest solution is to just specify a full affine transform matrix, with N rows and N+1 columns, where N is the total number of dimensions. You can just use a block-diagonal or permuted block-diagonal matrix if some of the dimensions don't "participate" in the transform. That is the approach used by Neuroglancer, at least in its UI and state representation. Neuroglancer internally imposes some restrictions on affine transforms that are supported, basically that the 3 display dimensions must be independent of the non-display dimensions, and shows an error message if an affine transform is specified that does not satisfy these restrictions.

Re 6:

I think we can distinguish a couple different cases: a. A non-zero integer origin. This commonly comes up when working with EM volumes. For example, we may produce an alignment with a given coordinate space, and then work with rectangular subvolumes from this space. Here it is natural to say our subvolume is an array with a non-zero origin, and index it based on the original coordinate space rather than the shifted coordinate space. This integrates quite well with normal zero-origin array indexing semantics. b. Arbitrary affine transforms and/or non-integer translations. Here we can no longer use integer indexing, and we run into issues of rounding, interpolation, etc. I don't think it makes sense to try to handle this in the same way as regular indexing, and I consider this out-of-scope for tensorstore. c. Multi-resolution data, where the resolutions are related by integer or (likely small denominator) rational factors and offsets. This is somewhere between (a) and (b). Integer indexing still makes sense and I think there are a lot of common use cases for this. However, we do have to deal with issues of rounding. I have been looking into supporting this in tensorstore, but it is not yet ready.

Thanks for the response, @jbms - I will cross-link this issue with the relevant ome-zarr issue 35 with the hope of cross-pollination.

Re 1: Yes, this is exactly what I meant. I also think that is it not widely used, and using alternative, specialized columnar storage options makes more sense. I was just thinking of how the spec for a "simple" example of e.g. storing a point cloud in zarr using a 2d array (points vs. xyz-coordinates) would look like. The axes definition we have been discussing so far is specifically related to regularly-spaced, homogeneous array. For the point cloud example it would require some way to specify metadata (labels, dtype etc.) for the columns etc. which does not fit this model very well. E.g. you could specify the axes label as points and coordinates but other fields like units etc. does not make much sense. So this is probably out-of-scope for the ome-ngff spec, but I saw some discussion of supporting ROIs. In any case, if NG were to support a columnar data format, which would would it be? If I understand correctly, for your annotation collections you basically specified your own "columnar" serialization format, and I was thinking that supporting a more widely used format (where IO libraries exist in different languages), like Apache Avro might be useful. I guess there are limitations to existing formats for your purposes/use cases, so I'm curious to understand which one's these were.

Re 2: Apparently the same question came already up here. Not sure how to decide this, but personally I like the approach taken by @d-v-b or the precomputed info file a bit more.

Re 4: I looked at the udunits2 package which seems quite comprehensive. I found another quite concise and comprehensive overview of labels etc in the Python pint library.

Re 5: Yes, I like the approach with specifying a full affine transform including non-spatial dimensions as well. And instead of storing this affine serialized as string in the metadata, a reference to a zarr dataset containing the affine may be useful. But the answer to the question of how to choose the display dimension is still not clear to me. I don't think implicit matching based on the name/label (x,y,z) is not generic enough, and one should be able to specify the mapping more explicitly. This also depends on the specific visualization tool, but perhaps it could be done in a generic way? The tool also needs to provide a way to define how the non-spatial/non-display axes' content are rendered.

In the NG Python bindings, I figured that I had to specify dimensions = neuroglancer.CoordinateSpace(names=['c^', 'z', 'y', 'x'], units=['', 'nm', 'nm', 'nm'], scales=[1, 40, 8, 8]) to do the right thing. The channel dimension is of length 3, so I could define a shader using this information. I feel like there could be a more principled approach of doing this.

Re 6: Yes, thanks for clarifying the different cases. a. While I understand the convenience, I must say that this confused me initially, as I always thought that index/array-space and world-space are two entirely different things, and the connection is only via a spatial transform (scaling, translation, affine). The semantic you describe and implemented with integer voxel_offset falls somehow in-between. For instance @d-v-b's non-integer translate is in physical units, whereas the voxel_offset is in dimensionless "index" units. So a natural question arises how they would work together, e.g. when you specify both voxel_offset AND a non-integer translate (or affine in the more generic case)? b. Yes, rounding, interpolation etc. can become quite tricky, and I understand that this is out-of-scope for tensorstore. However, is it correct that you already implemented this in Neuroglancer (e.g. arbitrary cutting plane angles for the visualization does that, right?). Do I understand correctly that then supporting an affine (with the restrictions you outlined) is in principle already supported by Neuroglancer, and it's just a matter of setting the affine from the metadata if it were specified in a compatible way? c. Can you briefly clarify what you mean with the common use cases for this? There seems to be also quite a bit of parallel work regarding multi-scale data, e.g. in xarray. I hope the the ome-zarr spec will be compatible with the precomputed format multi-scale spec in order to make it work (almost) out-of-the box with Neuroglancer.

Re Re 1: To be clear, by "columnar" data storage (and I believe this is consistent with standard database terminology, https://en.wikipedia.org/wiki/Column-oriented_DBMS), in the context of dense array storage, I mean that, if for example we have a regular x-y grid of points, and we want to store two different variables, like "temperature" and "pressure", or "red" and "green" from a microscope, then we would store chunks containing just temperature, chunks containing just pressure, chunks containing just red, chunks containing just green. That means when compressing we just have a sequence of "temperature" values, which will likely compress better than interleaved temperature and pressure values, and if we only need to read "temperature", then we don't have to read "pressure" as well, so we only need to process half as much data. We can already achieve this by storing. each variable as a separate zarr array that all have the same dimensions. Potentially though there could be some specification by which we could indicate more explicitly that (some of) the dimensions are common between these different arrays. In contrast, "row" storage would be achieved by using a single zarr array with a structured dtype like [["temperature", "<f4"], ["pressure", "<f4"]], or by adding an extra dimension of size 2 and chunk size 2. For a point cloud, where we have a sequence of points each with "x" and "y" coordinates, and perhaps some other variables like "color", column storage would mean we store chunks containing 1-d arrays of "x" values, chunks containing 1-d arrays of "y" values, and chunks containing 1-d arrays of "color" values. "Row" storage would mean storing chunks that contain "x", "y", and "color" values together.

It is true that for point cloud storage, "units" as we have been discussing is not a useful attribute. Instead, you may wish to associate units with the data type itself, e.g. you would say that the data type of "x" is "float32" and specifies a value in units of "4nm", while the data type of "y" is float32 and specifies values in units of "6nm". That would have to be specified with a different attribute than the "units" attribute we were discussing. xarray has the concept of a "coordinate array", where you say that the coordinates along one dimension correspond to values of another 1-d array. However, typically there is just one coordinate array. For the point cloud data we would have both the "x" coordinate array and the "y" coordinate array associated with the "color" array.

The neuroglancer precomputed annotation format is actually a "row" format, not a "columnar" format, since we store multiple variables together. As far as annotation storage, for small amounts of data the format does not much matter, since we can afford to load it all in memory, but for large amounts of data, to be able to display it interactively, it needs to be indexed suitably to allow the viewer to retrieve just the relevant portion needed for the current view. For that reason the precomputed annotation format indexes the data using a multi-scale spatial index as well as "relationship" indexes. There is the distinction between the high-level overall organization of the data, which determine which access patterns may be supported efficiently, and the lower-level encoding of individual chunks of the data, where each chunk will always be read in its entirety. For the lower-level encoding,it is perhaps a bit unfortunate that the neuroglancer precomputed annotation format has its own custom simple binary format --- using an established format like avro, apache arrow, etc. would probably have been a better choice, and perhaps could be done in a future version. For the higher-level organization, we are getting into the realm of databases rather than pure data streams. Of course there are a multitude of existing general purpose and specialized database formats. There are a few useful desirable properties:

We would like to be able to read the data directly from storage, without the need for a specialized database server. For Neuroglancer when the data is accessed directly from a "dumb" web sever like GCS or S3 that only supports byte range requests and has high latency, we need to be able to read the desired portion of the data with a relatively small number of relatively large requests, and we would like to be able to make requests in parallel rather than sequentially, in order to mitigate latency. This means we benefit from a "denormalized" format, e.g. where each index can provide us all of the data we need, rather than just using the index to look up rows in a separate table.
As the amount of data may be large, we want to be able to write the format in parallel from many machines, again without relying on a specialized database server. Standard database formats like postgres or sqlite, and their existing implementations, require that the entire database is written from a single machine.
We may require certain specialized indexes to accommodate efficient visualization (or other use cases). For example, for annotations we need a spatial index, and possibly a multi-scale spatial index, as well as other indexes (e.g. by segment id) depending on use cases. I am not aware of an existing database format that fits these criteria, which is why I created the precomputed annotation format.

Re Re 5: Currently there is no way to specify the "display dimensions" for Neuroglancer in the data format itself, except that Neuroglancer defaults to the first 3 dimensions. Instead you have to specify that separately in the Neuroglancer state. I think that should logically be separate from the data itself, as you will in general have many different views of the same data. Potentially there could be some standard for specifying data "views", and then associate some default views with the data, but that may quickly get into a lot of very application-specific details and the benefit may be small.

Re Re 6: a. In general there may be several (or many) different relevant coordinate spaces, and we could assign each a unique identifier. There is the coordinate space in which the data is stored, and then there may be any number of other coordinate spaces to which we may wish to map the data. I think in general it is desirable to minimize the number of coordinate spaces that we must deal with, in order to reduce complexity and mental overhead. There are two issues you raise: i. Regarding the coordinate space in which the data is stored, we can assume we will store the data as a dense array with integer indexing, but there is the question of whether it must have a zero origin, or may have a non-zero integer origin. Perhaps in the context of numpy it seems natural that the array must have a zero origin. To me it seems natural to allow a non-zero origin, as supported in tensorstore, because then when working with rectangular cutouts, as is common with EM data, we do not need to introduce a separate coordinate space for each cutout. ii. If we have both a non-zero origin (i.e. "voxel_offset"), and also an affine transform into some other coordinate space, I think we can just say the "voxel_offset" specifies the origin of the "storage" coordinate space for this array, while the affine transform indicates the correspondence to some other, presumably canonical coordinate space. I suppose in general you could support a list of transforms, with named "destination" coordinate spaces, but as with the issue of "display dimensions", we quickly get into application-specific issues, and the question of whether it makes sense to store these transforms with the data, or separately. I would guess that in many cases the data is first acquired and the transforms to other coordinate spaces may not be known. Only later is the data aligned to another coordinate space. Additionally, we also may quickly find that affine is not sufficient to align the data, and we may require other more complicated non-linear transforms. b. Yes, Neuroglancer already supports affine transforms, which you can configure currently via the "Source" tab for a layer. Additionally, Neuroglancer supports the "nifti" format, which can specify an affine transform into a "canonical" coordinate space, and Neuroglancer will use that transform if it is specified. It would be relatively easy to add similar support for other data sources. There might need to be some UI improvements to better support that, though, as currently those "inherent" transforms are used, but are not explicitly shown to the user. c. I think a common use case would be a convolutional (or other) model that receives as input data at multiple scales. When executing such a model we need to retrieve suitably-sized and centered bounding boxes from each of the scales. Supporting additional multi-scale metadata formats should not be a problem for Neuroglancer --- the internal model used in Neuroglancer currently is that there is a separate affine transform for each scale to a common coordinate space used by the data source. The user can then optionally apply an additional affine transform on top of that for the entire data source. However, this might not be a suitable abstraction for other use cases, like executing multi-resolution models on the data.

Thanks for your very detailed and informative answer, @jbms . I think this issue and answer contains a lot of useful information for any ome-zarr standardization efforts. I'd like to reply with a few more points/questions to your answer, but it will take me a bit of time.

Is the _ARRAY_DIMENSIONS attribute still supported by neuroglancer? I can't even get the axes names to work like OP did. I tried to follow the suggested xarray example, but I'm still getting the axis names 'd0','d1','d2' rather than the desired names 'x','y','z' in the viewer:

e.g.

Make the data: make_zarr

serve with cors_webserver.py, then go to the demo-appspot

viewer

google / neuroglancer

What are correct metadata attributes when adding a zarr array #333