ome / ome-zarr-py

Implementation of next-generation file format (NGFF) specifications for storing bioimaging data in the cloud.
https://pypi.org/project/ome-zarr
Other
150 stars 53 forks source link

Add class annotations and/or other metadata properties to labels #60

Open DragaDoncila opened 3 years ago

DragaDoncila commented 3 years ago

Currently the labels spec supports the declaration of a label-value and its associated color.

Commonly, label values have other associated information including the most obvious, the class name. napari also supports display of label properties, so this would be a nice additional feature for the reader plugin.

I think the critical requirements for these properties should be:

There are three ways I can see the spec supporting these additional properties:

  1. Arbitrary number of lists of max length n for a label image containing n label values, each corresponding to a property. The index in the list corresponds to the integer label-value e.g.
    "image-label": {
        "version": "0.1",
        "colors": [
            {
                "label-value": 1,
                "rgba": [
                    255,
                    100,
                    100,
                    255
                ]
            },
            {
                "label-value": 2,
                "rgba": [
                    0,
                    40,
                    200,
                    255
                ]
            },
            {
                "label-value": 3,
                "rgba": [
                    148,
                    50,
                    165,
                    255
                ]
            }
        ],
        "properties": [
            {
                "class": [
                    "Urban",
                    "Water",
                    "Agriculture"
                ],
                "area_m2":
                [
                    "400",
                    "1532",
                    "590"
                ]
            }
        ]
    }

I think this is least explicit, and less intuitive than the next approaches.

  1. Declare another group similar to colors, where each label-value has its own associated properties:

    {
    "multiscales": [
        {
            "datasets": [
                {
                    "path": "0"
                },
                {
                    "path": "1"
                },
                {
                    "path": "2"
                },
                {
                    "path": "3"
                }
            ],
            "version": "0.1"
        }
    ],
    "image-label": {
        "version": "0.1",
        "colors": [
               ...
        ],
        "properties": [
            {
                "label-value": 1,
                "class": "Urban",
                "area_m2": "400"
    
            },
            {
                "label-value": 2,
                "class": "Water",
                "area_m2": "1532"
            },
            {
                "label-value": 3,
                "class": "Agriculture",
                "area_m2": "590"
    
            }
        ]
    }
    }

    This is explicit, but has the disadvantage of duplicating the label-value definitions.

  2. Make color another property e.g.

        "properties": [
            {
                "label-value": 1,
                "rgba": [
                    255,
                    100,
                    100,
                    255
                ],
                "class": "Urban",
                "area_m2": "400"
    
            },
            {
                "label-value": 2,
                "rgba": [
                    148,
                    50,
                    165,
                    255
                ],
                "class": "Water",
                "area_m2": "1532"
            },
            {
                "label-value": 3,
                "rgba": [
                    148,
                    50,
                    165,
                    255
                ],
                "class": "Agriculture",
                "area_m2": "590"
    
            }
        ]

This doesn't duplicate label-values, and has the benefit of keeping all properties associated with a particular label-value in one spot.

On the implementation side, I think the differences in parsing the properties are negligible.

I'd love to hear what other people think are appropriate ways to represent the properties in the label metadata, or what they think the best option is.

imagesc-bot commented 3 years ago

This issue has been mentioned on Image.sc Forum. There might be relevant details there:

https://forum.image.sc/t/multi-scale-image-labels-v0-1/43483/3

joshmoore commented 3 years ago

Hi @DragaDoncila. Sorry for the slow response. Took some time to get caught up after the call. :wink:

Having this conversation kicked off is great! And I certainly like what you're proposing with 3, but even though the v0.1 proposal hasn't really been officially released, there are a number of repositories that are already implementing it.

A few options I can imagine are:

I should add that I think another similar breaking change may come when tabular data is supported in which case we may move some of this metadata into arrays for dealing with very large numbers of labels.

manics commented 3 years ago

Option 3 looks the cleanest but a big disadvantage is future additions to the spec may use property names that now clash with the user-defined ones unless there is some way to indicate reserved names. In this respect Option 2 seems better despite the duplication of label-value, as all user-defined properties can go under properties without worrying about future conflicts.

manics commented 3 years ago

Option 4 could be a variant of 3 where the user properties are under a dedicated subkey (I can't think of a good name so I've called it extra-properties in the example):

        "properties": [
            {
                "label-value": 1,
                "rgba": [
                    255,
                    100,
                    100,
                    255
                ],
                "extra-properties": {
                  "class": "Urban",
                  "area_m2": "400",
                  "other": [1, 2, 3, 4]
                }
            },
will-moore commented 3 years ago

I think I prefer Option 2! I don't see a big problem with duplication of the label-value key, and this is also clearer that spec-defined attributes (e.g. colors) are easily distinguished from custom properties. No naming conflicts, but without so much nesting as Option 4.

DragaDoncila commented 3 years ago

Hi everyone,

Sorry for the late response - I've been finishing my honours thesis over the last few days so it's been packed.

Thanks for all the input! Having read through the suggestions here, I think @manics concern about clashes with future reserved names is the biggest disadvantage of Option 3. The extra-properties or user-properties subkey would definitely solve this issue but seems less elegant.

Despite initially thinking Option 3 was the way to go, I now actually think I agree with @will-moore that Option 2 seems preferable, as it fully separates spec properties and user defined properties.

@joshmoore how does that mesh with your longer term view of tabular metadata?

manics commented 3 years ago

I don't think we necessarily need to constrain ourselves to the equivalent of tabular metadata for the properties key, instead we could say it's an array of JSON values. These could be flat key-value dictionaries or arrays if the intention is to convert them to a table, but nested dictionaries could also be allowed.

imagesc-bot commented 3 years ago

This issue has been mentioned on Image.sc Forum. There might be relevant details there:

https://forum.image.sc/t/multi-scale-image-labels-v0-1/43483/9

joshmoore commented 3 years ago

I don't think we necessarily need to constrain ourselves to the equivalent of tabular metadata for the properties key,

I was thinking about the reverse. I can see having the JSON keys be "deeper", but what does one do when wants to add gigabytes of tabular data? It's not required to solve that now but it will come up eventually.

For what it's worth, https://www.w3.org/TR/csv2json/ has some examples. Looks like the method there is a top-level object per row.

All the being said, I can definitely still see option 2 as a first non-breaking change that we iterate on.

cc: @manzt

tischi commented 3 years ago

Hello, we were also thinking about image regions where objects overlap, see discussion here: https://forum.image.sc/t/multi-scale-image-labels-v0-1/43483/7

I am not sure, but maybe this could be tackled by something like:

properties": [
            {
                "label-value": 1,
                "associated-label-values": [3]
                "class": "Urban",
                "area_m2": "400"

            },
            {
                "label-value": 2,
                "associated-label-values": [3]
                "class": "Water",
                "area_m2": "1532"
            },
            {
                "label-value": 3,
                "child-labels": [2, 1]
            }
]

This would mean that label 3 is a region where labels 1 and 2 overlap. It also means that the image region that semantically corresponds to label 1 is actually bigger, namely the union of the regions covered by label 1 and 3.

The "associated-label-values" is redundant with the "child-labels" and maybe should be removed. I added it here because, in practice, it could be good to see at one glance that label 1 alone does not fully cover "object 1" but only when combined with the image region covered by label 3.

@constantinpape, do you maybe have comments or suggestions?

constantinpape commented 3 years ago

@tischi yes, I think this could be a good solution for overlapping labels.

I think this opens up a few more questions that are maybe also relevant for the overall discussion of the label properties:

tischi commented 3 years ago

If a field is present in one properties, does it need to be in all the others? E.g. do we need class in all elements in the property list?

I would say if we go for above list based approach we should not require a field to be present for all labels. If the storage layout would be more table based, then, I guess, yes, we would have to.

I think above list based approach is nice as it provides a lot of flexibility in terms of different labels having more or less information attached to them.

The disadvantage that I see compared with a table based approach is that it will require more storage space and could thus be quite slow to download and parse in order to e.g. build a table from it.

Thus for use cases with millions of labels I am a bit worried about performance.

manics commented 3 years ago

I think we'll want both options: JSON style nested dictionaries for arbitary properties and support for tabular data. In the short term JSON dictionaries are relatively easy to add to the spec so it makes sense to start there.

joshmoore commented 3 years ago

Whew. Ok. So it sounds like we have some points for future discussion, but generally a consensus that we could start building, no? @DragaDoncila, have you already started on a branch anywhere? If not but were looking to start, do you think you have everything you need for a first pass?

DragaDoncila commented 3 years ago

@joshmoore I've started a branch, which has Option 1 already implemented. From what I read here, Option 2 is the consensus to start with, before we move on to adding support for tabular data. I think I have everything I need for a first pass, so I'll put up a WIP PR by Monday afternoon if that timeline is okay

joshmoore commented 3 years ago

Sounds amazing. Thanks, @DragaDoncila !