ome / ngff

Next-generation file format (NGFF) specifications for storing bioimaging data in the cloud.
https://ngff.openmicroscopy.org
Other
116 stars 38 forks source link

Managing image segmentation data (mutability of ome.zarr) #42

Open tischi opened 3 years ago

tischi commented 3 years ago

@joshmoore @constantinpape

We (ping @cgirardot) have been thinking a bit about a data management with ome.ngff and had a question/ concern.

Let's say you start with an ome.zarr container that only contains the raw data and then you compute a segmentation (label mask image).

If you add this label mask image into the original ome.zarr, you sort of mutate its identity, because its content is changing, which may not be ideal from a data management perspective.

If you instead were to create a new ome.zarr containing both the raw data and the segmentation, you would have to copy the raw data, which may be prohibitive.

So we were wondering if the idea is to create a new ome.zarr container that only contains the label mask data and a link to the raw data, such that viewers would still open it as if it would contain both the raw and segmentation data.

Any thoughts on this?

joshmoore commented 3 years ago

13 should enable that. However, from my point-of-view, there will still be mutation use cases as well, so I would hope we could define an "internal identity" so the community would feel comfortable adding after the fact.

tischi commented 3 years ago

@joshmoore Could you elaborate on the idea of an "internal identity"? Do you already have a vision how that could work in practice? Let's say I have an ome.zarr with only raw data, let's call this image A (only raw). Then I add a label mask to this ome.zarr, let's call this image B (raw and labels). From a data management point-of-view: would image A still exist or does it disappear during the creation of image B? I think we would be good if we could come up with a solution such that image A in fact does still exist. Because data provenance wise A is the origin of B and it is good to keep track of this. Also it is good to be able to go back to A in case one needs to recompute B.

joshmoore commented 3 years ago

Could you elaborate on the idea of an "internal identity"?

my_experiment.zarr/
├── analysis
│   └── segmentation
└── image_data

One of the keys of linked data is the ability to reference entities by name. So here I would think image_data and its graph of data & metadata would have an identifier (e.g. urn:uuid:d9dfa7ca-a7ee-11eb-a679-5f0cec9f8212). The segmentation would as well. The segmentation would talk about the image_data (likely not the other way around) forming a graph. You could refer to either of them externally as well and assert that they are independent, e.g. for defining a DOI. To some extent, this isn't much different than having:

my_experiment_image_data.zarr
my_experiment_segmentation.zarr

and having metadata at yet another level that ties them together, except it would provide a consistent framework for doing so in one fileset if you wanted.