ome / ngff

Next-generation file format (NGFF) specifications for storing bioimaging data in the cloud.
https://ngff.openmicroscopy.org
Other
110 stars 38 forks source link

RFC: Zarr v3 #227

Closed normanrz closed 2 months ago

normanrz commented 4 months ago

This is an RFC proposal for adopting Zarr v3 as the new storage format for OME-Zarr.

It is a followup to the discussions in #206 and on image.sc.

Briefly, this proposal aims to adopt Zarr v3 as the new format for the next version of OME-Zarr. That unlocks the new features of Zarr v3 including sharding. Zarr v2 would not be allowed anymore (only through older versions of OME-Zarr). Additionally, there are some small changes to the OME-Zarr metadata to improve namespacing and versioning.

This RFC is currently in draft status with the goal of clarifying questions before the full review. Additional endorsements are also welcome.

Check this link for a review: https://ngff--227.org.readthedocs.build/rfc/2/index.html

github-actions[bot] commented 4 months ago

Automated Review URLs

imagesc-bot commented 4 months ago

This pull request has been mentioned on Image.sc Forum. There might be relevant details there:

https://forum.image.sc/t/adopt-zarr-v3-in-ome-zarr/84786/2

kevinyamauchi commented 4 months ago

I think moving to zarr v3 is a great step. I would be happy to "endorse" this RFC.

Thanks, @normanrz !

clbarnes commented 4 months ago

Does the sharding codec need to be detailed here or can we just name-drop it as an advantage of zarr v3 and then link to zarr's information about it?

There are a couple of things which could change in NGFF to make it more zarr-y. They're not blockers to adoption, just something to reduce downstream users' headaches, which IMO will never be addressed if not now.

One is the the case convention: zarr uses snake_case, NGFF uses camelCase.

The other is how enums are tagged: zarr uses adjacently tagged ({"name": "something", "config": {"somekey": ..., "anotherkey": ...}}) where NGFF uses internally tagged ({"name": "something", "somekey": ..., "anotherkey": ...}).

d-v-b commented 4 months ago

One is the the case convention: zarr uses snake_case, NGFF uses camelCase.

The other is how enums are tagged: zarr uses adjacently tagged ({"name": "something", "config": {"somekey": ..., "anotherkey": ...}}) where NGFF uses internally tagged ({"name": "something", "somekey": ..., "anotherkey": ...}).

These are both very good points -- on the latter, you might want to weigh in over at https://github.com/ome/ngff/pull/138, where a lot of new enums are being minted.

normanrz commented 4 months ago

Does the sharding codec need to be detailed here or can we just name-drop it as an advantage of zarr v3 and then link to zarr's information about it?

I think sharding is a major motivation to move to v3. That is why it gets so much space in my proposal. It is intended as background information and, in the end, has no direct implications on the OME-Zarr spec.

There are a couple of things which could change in NGFF to make it more zarr-y. They're not blockers to adoption, just something to reduce downstream users' headaches, which IMO will never be addressed if not now.

One is the the case convention: zarr uses snake_case, NGFF uses camelCase.

The other is how enums are tagged: zarr uses adjacently tagged ({"name": "something", "config": {"somekey": ..., "anotherkey": ...}}) where NGFF uses internally tagged ({"name": "something", "somekey": ..., "anotherkey": ...}).

Pinging @joshmoore for advice on whether that should be a separate RFC. I think it is possible to bundle multiple RFCs into one new version of the spec.

bogovicj commented 4 months ago

Thanks @normanrz ! I'm also happy to endorse this RFC! :

matthewh-ebi commented 4 months ago

Also happy to endorse, this would be very beneficial for our use cases, huge thanks @normanrz !

tischi commented 4 months ago

Thanks @normanrz, I am also happy to endorse this RFC.

jluethi commented 4 months ago

Thanks a lot @normanrz , I am also happy to endorse this RFC! It will be great to get sharding for OME-Zarrs.

constantinpape commented 4 months ago

I also endorse this RFC!

will-moore commented 4 months ago

I'm happy to endorse this RFC! 👍

jni commented 4 months ago

Obviously I'm all in favour of supporting v3, but:

Zarr v2 would not be allowed anymore (only through older versions of OME-Zarr).

What is the motivation for this? Why should we couple ome-zarr and zarr so tightly? If someone has ome-zarr v0.5 (or whatever) metadata at the root of a v2 zarr folder, why should that be forbidden?

jni commented 4 months ago

Gaaaah, don't comment before reading the article! 😅 I just read:

The metadata of Zarr v3 arrays are not backwards compatible with Zarr v2.

which explains it. However, it does still seem lightweight to support both on the ome side?

normanrz commented 4 months ago

However, it does still seem lightweight to support both on the ome side?

While it is easy to support both versions in the OME spec document, I'm concerned with the complexity burden for implementations. I'd rather not add one dimension to the compatibility matrix. OME-Zarr implementations that build upon libraries probably have good support for v2 and v3 at the moment. However, this might change in the future (anyone remember what happened to Zarr v1?). For example, in zarr-python, we are working on a refactoring that is v3-first. It is not unlikely that in the future v2 support will become deprecated. Also, there are implementations that roll their own Zarr stack that would need to add support for v2 and v3.

From a OME-Zarr user perspective, the hard cut also makes things simpler: ≤ 0.5 => Zarr v2; > 0.5 => Zarr v3 (or whatever the version number will be). If users wish to upgrade their data from one OME-Zarr version to another, it would be easy to also migrate the core Zarr metadata to v3. This is a fairly cheap operation, because only json files are touched. Zarr v2 and v3 metadata could even live side-by-side in the same hierarchy. There are functions available that can migrate the metadata automatically (e.g. in zarrita and soon zarr-python).

jni commented 4 months ago

Sure, I guess it is indeed easier as a user to know if you have an ome-zarr v0.5 file that all readers would support it, rather than have to understand whether your reader supports zarr v2. It would indeed be easy to get into a situation like "does this USB cable support data transfer and at what speed?" 😅

Anyway, please take my question as more for my own information rather than as a blocker: I too am happy to endorse this plan. 😊

joshmoore commented 4 months ago

normanrz commented 4 days ago Pinging @joshmoore for advice on whether that should be a separate RFC. I think it is possible to bundle multiple RFCs into one new version of the spec.

The current plan is definitely to collect multiple RFCs into a single spec version, but that being said, I personally don't think sharding needs a separate RFC. With the move to v3, we have the chance to decouple the NGFF specific from specifics of the backend (looking at you, dimension separator), so I agree with @clbarnes that should be more referencing an existing spec (ZEP1 & ZEP2) but we do need to make users & implementers aware of the trade-offs that a given backend provides them. It's going to be a fine balance.

normanrz commented 3 hours ago

However, it does still seem lightweight to support both on the ome side?

While it is easy to support both versions in the OME spec document,

I still need to go through the text, but a :+1: for including any explanations you give here in the main text if they are not already there.

joshmoore commented 4 months ago

joshmoore commented 19 hours ago] ... so I agree with @clbarnes that should be more referencing an existing spec (ZEP1 & ZEP2)...

@normanrz gently pointed out that I had misunderstood his question. The point was whether or not this RFC should include issues about camelCasing, etc. I would think not. Judging simply be the amount of endorsers this has already received, adding more issues especially ones that are prone to bike shedding can only make things less clear. (And for such topics, it likely makes sense to do some consensus building outside of the RFC before bringing it for review (D2 "gather support") but that's beyond the scope of this PR. :smile:)

normanrz commented 4 months ago

Thanks for the endorsements and feedback! I moved the proposal to draft state D3, which is intended to clarify questions before the review. Any feedback and questions are appreciated. More endorsements are also very welcome!

Based on the feedback, I moved the sections about sharding to "Background" (because it is not really part of this RFC; just illustrates a motivation for adopting v3) and added the motivation for dropping v2 support in the text.

perlman commented 4 months ago

+1 to the chorus of endorsements.

I am looking forward to using Zarr v3 with several of my hats.

ziw-liu commented 3 months ago

Thanks @normanrz! I endorse this RFC.

LDeakin commented 3 months ago

I endorse this RFC.

Some comments:

normanrz commented 3 months ago
  • I like that the OME-Zarr metadata can now also be stored in array attributes. That partially satisfies this request https://github.com/ome/ngff/issues/207. multiscales in an array will need additional constraints on the datasets part though (e.g., single element with empty path).

While I would like that, I didn't intend to change this behavior as part of this RFC.

imagesc-bot commented 2 months ago

This pull request has been mentioned on Image.sc Forum. There might be relevant details there:

https://forum.image.sc/t/ome-ngff-update-postponing-transforms-previously-v0-5/95617/1

joshmoore commented 2 months ago

As described in https://forum.image.sc/t/ome-ngff-update-postponing-transforms-previously-v0-5/95617/2, merging this to move forward with the first round of reviews. Thanks all for your feedback & endorsements.

mkitti commented 2 months ago

Commenting solely in an individual capacity, I endorse the premise that "OME-Zarr should adopt Zarr v3 as the storage format". I have recently deployed Zarr v3 shards within Janelia on behalf of Philip Keller's Lab for use with neuroglancer, tensorstore, and other compatible tools.

The content of the RFC itself is confusing. It seems to mostly replicate parts of the Zarr v3 specification and highlight key changes. Rather than duplicate such specification content, the RFC should mainly reference the canonical Zarr v3 specification.

Lacking from the RFC is content pertinent to OME-Zarr as a standard. I would particularly like to see the following.

  1. Details about how OME metadata would be integrated into the new storage format.
  2. Clarity about the status of image data in Zarr v2. Is Zarr v2 deprecated? Do we plan to maintain Zarr v2 as part of the OME-Zarr specification?
  3. Should there be OME-Zarr v2 and OME-Zarr v3 specifications?
  4. Further guidance on how to transition Zarr v2 content to Zarr v3. For each of the metadata examples in the current specification, how do those appear in the Zarr v3 specification?
  5. Is it compliant for an array to contain both Zarr v2 and Zarr v3 metadata files side by side?
imagesc-bot commented 2 months ago

This pull request has been mentioned on Image.sc Forum. There might be relevant details there:

https://forum.image.sc/t/ome-ngff-update-postponing-transforms-previously-v0-5/95617/4

normanrz commented 2 months ago

Thanks for your feedback @mkitti! Apart from just referencing the Zarr v3 spec, I thought it would be useful to highlight some of the v3 features as a motivation for adopting the new version. I didn't want to assume that everybody in this community is aware of all the changes in Zarr v3. I'll make sure to better separate that from the actual changes to OME-Zarr in the next iteration of the RFC after the first round of reviews. I also hope that things will become clearer once I added the changes to the spec document.

Lacking from the RFC is content pertinent to OME-Zarr as a standard. I would particularly like to see the following.

  1. Details about how OME metadata would be integrated into the new storage format.

The OME metadata will be stored in the attributes of the the groups' zarr.json files under a new versioned namespace https://ngff.openmicroscopy.org/0.5. This is explained in this section in the RFC.

  1. Clarity about the status of image data in Zarr v2. Is Zarr v2 deprecated? Do we plan to maintain Zarr v2 as part of the OME-Zarr specification?
    1. Should there be OME-Zarr v2 and OME-Zarr v3 specifications?

OME-Zarr <0.5 only supports Zarr v2 and OME-Zarr ≥0.5 will only support Zarr v3. It is recommended that implementation support a number of OME-Zarr versions to support for reading existing data. I think that recommendation is useful not only for this RFC. This is discussed in this section in the RFC.

  1. Further guidance on how to transition Zarr v2 content to Zarr v3. For each of the metadata examples in the current specification, how do those appear in the Zarr v3 specification?

There is some mention of migration scripts in this section of the RFC. The next iteration of the RFC after the first round of reviews will also contain the changes to the spec document, with updated examples.

  1. Is it compliant for an array to contain both Zarr v2 and Zarr v3 metadata files side by side?

Yes, Zarr v2 and v3 metadata files can exist side-by-side. As a consequence, OME-Zarr 0.4 and 0.5 metadata should also be able to exist side-by-side. With the new versioned namespace, this will be even easier for future versions. We should probably add an explicit recommendation that implementations prefer the newest version of the metadata that they support.