Define schema for zarr pyramids

manzt commented 4 years ago

Description

To my knowledge, it is quite open-ended how people create zarr array pyramids. I've currently used the following schema, where each array is a found in the pyramid group.

.
└── dataset.zarr/
    ├── .zgroup
    └── pyramid/
        ├── .zgroup
        ├── 00/
        │   ├── .zarray
        │   ├── 0.0.0
        │   ├── 0.0.1
        │   └── ...etc
        ├── 01/
        │   ├── .zarray
        │   ├── 0.0.0
        │   └── ...etc
        └── 02/
            ├── .zarray
            ├── 0.0.0
            └── ...etc

But others may store their pyramids in different zarr arrays all together.

.
├── dataset00.zarr/
│   └── .zarray
│       ├── 0.0.0
│       ├── 0.0.1
│       └── ...etc
├── dataset01.zarr/
│   └── .zarray
│       ├── 0.0.0
│       └── ...etc
└── dataset02.zarr/
    └── .zarray
        ├── 0.0.0
        └── ...etc

I don't think much focus has gone into standardizing this because many people tiling with zarr are using napari, and that library affords flexibility by requiring users to load dask arrays into a list first. For each example above, we could load the same thing into napari with:

import napari
import dask.array as da

pyramid = [
    da.from_zarr("dataset.zarr", component=f"pyramid/{str(i).zfill(2)}")
    for i in range(3)
]

# or

pyramid = [
    da.from_zarr(f"dataset{str(i).zfill(2)}.zarr")
    for i in range(3)
]

viewer = napari.Viewer()
viewer.add_image(pyramid, is_pyramid=true)

As long as viewer.add_image gets a list of ndarrays, napari knows what to do. We likely can't be as flexible as this, so I would like to iron out some type of standard for creating these arrays. Ultimately in viv we have a very similar pattern because connections for the zarr loader are just an array of zarr objects.

Proposal

In viv, we should make the nested format from above the default. This way we can create pyramids and keep them all in the same named directory. Also, if a dataset has more than a pyramid (i.e. IMS + MxIF) we could create different groups which house these data within the zarr store, keeping together data modalities which will be visualized together.

.
└── dataset.zarr/
    ├── .zgroup
    └── pyramid/
        ├── .zgroup
        ├── 00/
        │   └── .zarray
        ├── 01/
        │   └── .zarray
        └── 02/
            └── .zarray

Metadata

The metadata for the pyramid should be contained in the .zattrs of the 00/ array. This is a JSON file and should have the required fields:

max_level # number of total pyramid levels (zero indexed)
channel_names
dimension_names # last two dimensions should be y, x but might have time, channel, etc...

Providing the max_level will let us know how many connections to establish for the viewer. We should determine imageHeight, imageWidth, and tileSize all from the zarr array chunks and shape data.

Additional flexibility

In case any of the metadata are missing or someone has a more bespoke zarr schema, we should allow an config object to manually set these parameters:

const config = {
    "pyramid": [
       "http://some-long-url/dataset00.zarr",
       "http://some-long-url/dataset01.zarr",
       "http://some-long-url/dataset02.zarr",
        // Each url should point to endpoint with .zarray file
        // We can write checks to make sure that arrays are decreasing in size (napari does this)
        // max_level is now implicit with number of elements in array
    ],
    "metadata": {
        "channel_names": [
            "channel0", 
            "channel1",
            "channel2",
            "channel3",
            "channel4", 
            // Throw an error if cnames don't match up with num dims
        ],
        "dimension_names": [
            "time",
            "channel",
            "y", 
            "x",
            // We'll throw an error of last to aren't y, x
           // or num dims doesn't match underlying arrays
        ]
    }
}

mccalluc commented 4 years ago

Also, if a dataset has more than a pyramid (i.e. IMS + MxIF) we could create different groups which house these data within the zarr store

Is this in the example below? Would there be pyramid-ims and pyramid-mxif, or do you mean something else?

max_level # number of total pyramid levels (zero indexed)

So if there are 4 levels (0, 1, 2, 3), max_level == 3?

channel_names

Instead of just making this an array of strings, I'd suggest one extra level, with name as the one required field. In the example right now, we know that the round is another piece of metadata that matters, and there may be others:

channels: [
  {
    name: 'DAPN',
    round: 1,
    timestamp: '2001-01-01T00:00:00',
    ...
  }
]

dimension_names

This would be all dimensions, not just the last two? Here, I'd think a string array would be sufficient?

// Throw an error if cnames don't match up with num dims

I think you mean the number of channels should be product all but the last two dimensions? so 4 timepoints x 3 colors x 256 x 256 would just be 4 * 3 = 12?

// We'll throw an error of last to aren't y, x

Does it need to be that strict? If someone had "south" and "east", is that forbidden?

I think this is good. I could imagine instead of flat lists of channels, if it's more than 3D, there are extra nested levels... but I'm not sure this would make life easier.

manzt commented 4 years ago

max_level # number of total pyramid levels (zero indexed)

So if there are 4 levels (0, 1, 2, 3), max_level == 3?

max_level would equal 4 in this case. It corresponds to the minZoom required be deck.gl.

minZoom === -max_level.

channel_names

Instead of just making this an array of strings, I'd suggest one extra level, with name as the one required field.

I agree with this currently. A flat array doesn't quite support the functionality we need. With that said, depending on the modality, OME-TIFFs might store this differently. For some datasets, channels might be the same for all other dimensions (ie, z or time), but some might stain with different antibodies for different time points. We need to keep track of these labels.

For example, the CyCIF dataset I've been looking at has shape == (9, 4, 12000, 12000) and dimension_names == (stain_round, channel, y, x). But the metadata for the TIFF has len(channels) == num_images == 36. Here the information about rounds is lost, but perhaps it's up to the user to chunk this list (since the number of rounds and channels each round are known).

Ideally we want an object that for each tiled image ( (x,y) slice at particular stain_round and channel axis) we have the ability to index the zarr array, or given slice indices, we can look up the names of the modalities. For datasets where the number of images is small (< 100), then using json makes a lot of sense. However, much larger than that, this could be another compelling use case for DataFrames with ApacheArrow. Each row describes the image at a particular slice.

┌─────────┬─────────┬────────────┬───────────────┐
│  time   │ channel │ time_index │ channel_index │
├─────────┼─────────┼────────────┼───────────────┤
│ round 1 │ DAPI    │          0 │             0 │
│ round 1 │ EMPTY   │          0 │             1 │
│ round 2 │ EMPTY   │          1 │             2 │
│ round 2 │ MART1   │          1 │             0 │
│ round 3 │ CD163   │          2 │             2 │
│ round 4 │ CD38    │          3 │             3 │
└──────  ─┴─────────┴────────────┴───────────────┘

This would make data-binding straighforward. Use case: In the UI, a user wants to examine all 'DAPI' stains. We expose the rows of the table above (minus the index information), and allow the user to filter this list. When a selection is made, we render the image at arr[time_index, channel_index, :, :]. We can also provide sliders for moving along these dimensions of the data, and perform a look up for the corresponding time and channel labels to show the user.

// We'll throw an error of last to aren't y, x

Does it need to be that strict? If someone had "south" and "east", is that forbidden?

It doesn't need to be this strict. We just want to ensure that the the image data is row-major, with the last two dimensions being y and x coordinates in deck.gl.

manzt commented 4 years ago

The stuff for the metadata above might be somewhat outside the scope of this issue but I think it is a good place to have it.

manzt commented 4 years ago

After talking to @ngehlenborg, I think storing the highest resolution array outside of the pyramid makes a lot of sense:

.
└── mxif_data.zarr/
    ├── .zgroup
    ├── base
    │   ├── .zarray
    │   ├── .zattrs
    │   ├── 0.0.0
    │   └── ...etc
    └── pyramid_levels/
        ├── .zgroup
        ├── 01/
        │   ├── .zarray
        │   ├── 0.0.0
        │   └── ...etc
        └── 02/
            ├── .zarray
            ├── 0.0.0
            └── ...etc

For a particular URL, we could check raw/.zattrs for the max_levels key. If max_levels, then we know that the image is a pyramid and we can create connections to the children downsampled arrays. If not max_levels, then we know that the image isn't tiled.

NHPatterson commented 4 years ago

I think this format makes sense and leaves open adding additional data like segmentation in the store underneath the mxif_data.zarr. I think having a metadata tag about the relationship of physical coordinates to pixel coordinates is essential (1 pixels = 5 physical distance unit, with an attribute for both the scale and unit). I'm not sure if the vitessce-image-viewer has incorporated a scale bar but it would need this information to create one.

manzt commented 4 years ago

Thanks for the feedback @NHPatterson !

I think having a metadata tag about the relationship of physical coordinates to pixel coordinates is essential (1 pixels = 5 physical distance unit, with an attribute for both the scale and unit).

Agreed. Should we add this as a required field in Viv? The current fields are required in the sense that nothing will render if not provided, but we could have a very pesky error message saying "field not provided cannot create scale bar". I've been adding this metadata to the .zarr stores I've been creating and we will certainly parse this for use in the scale bar. I'm thinking for metadata we should have something like:

// .zattrs
{
   "max_level": 4,
   "dimension_names": [
        "T",
        "C",
        "Y",
        "X"
    ],
    "cnames": [
        "Channel 1",
        "Channel 2",
        "Channel 3",
        "Channel 4",
    ],
    "mc": true,
    "rgb": false,
    "samples_per_pixel": 1,
    "size_c": 48,
    "size_t": 1,
    "size_x": 12291,
    "size_y": 12336,
    "size_z": 1,
    "x_scale": 0.324999988079,
    "y_scale": 0.324999988079,
}

All we really need to make a guess about how to render from here is "max_levels" key. We could make all other fields optional, and then in the UI communicate that there are fields missing, but a user would still be able to look at their data using sliders like in napari. If they forgot to include a piece of metadata, we could ask that they write that to the zarr store or provide an additional object containing this information. That way we can always upgrade someone's use of vitessce, but not block anyone out who hasn't quite gotten their format right yet.

NHPatterson commented 4 years ago

You could default to 1 for scale and pixel for unit within viv. It shouldn't be a required field but certainly for any biological imaging data, knowing the image scale is important. Perhaps as an editable field in an image metadata pop-up? It may seem far off but having an 'info' button next to the image plane that one can click and get all of this embedded information would be useful, with some of the attributes being modifiable.. looks like your edit beat me to the punch.

manzt commented 4 years ago

Ah that would be great. Allow the option to add fields in a form-like entry or cut and paste JSON.

manzt commented 4 years ago

We should also version this schema in the .zattrs of the pyramid base.

ngehlenborg commented 4 years ago

@manzt: can this be closed?

manzt commented 4 years ago

Yes, we doing our best to follow what is being decided by the OME community so we don't roll our own solution for zarr. https://github.com/ome/omero-ms-zarr/blob/master/spec.md

hms-dbmi / viv