ome / ngff

Next-generation file format (NGFF) specifications for storing bioimaging data in the cloud.
https://ngff.openmicroscopy.org
Other
111 stars 38 forks source link

OME Metadata Support #104

Open joshmoore opened 2 years ago

joshmoore commented 2 years ago

This issue captures the requirements as well as the possible implementation choices for a first integration of OME metadata into the OME-NGFF container. The UoD team is targeting the specification as well as implementations, including within OMERO, by mid-2022.

Goal

NGFF specifications up to version 0.4 contain only minimal metadata fields that cover the existing OME model (e.g. physical pixel size). As a result, converting data into OME-NGFF, e.g. with bioformats2raw, loses more metadata than the equivalent conversion to OME-TIFF. The goal of this issue is to achieve a parity between the two formats in terms of capturing metadata contained in the OME model (2016-06).

Here we would like to discuss, plan, and specify an initial integration of the OME model into OME-NGFF. As with other specifications, this initial work will likely be followed by multiple, possibly breaking, changes to expand the scope. Where possible, we will also try to capture that roadmap here.

In-scope requirements

Out-of-scope requirements


Design decision #1: Location of metadata

The current NGFF specification solely uses the formats custom-attributes (.zattrs) for storing metadata. Several other locations are conceivable, though some are more or less within the bounds of the Zarr specification (See https://github.com/zarr-developers/zarr-specs/issues/112 for more discussion.)

Option a) .zattrs

The status quo at the moment is that all metadata should be represented as JSON in .zattrs. The benefit is that no new mechanism needs to be introduced. A downside is that metadata is spread across multiple zgroups and zarrays (See related comments in https://github.com/ome/ngff/issues/102). Projects such as xarray store metadata in “well-known” keys within the .zattrs like _ARRAY_DIMENSIONS (docs).

Option b) Custom files

An alternative is to introduce new files outside the scope of the Zarr spec, which only defines .zattrs, .zarray, .zgroup, and chunk files. bioformats2raw currently stores metadata in a file named METADATA.ome.xml. Other projects like netcdf-c store custom files (e.g. .nczarr; docs) with their own proprietary customizations. The benefit of this strategy is maximum flexibility since no key conflicts can occur. Implementations may need to be aware that such files are essentially 1-dimensional byte arrays.

Option c) Arrays

Metadata files can be encoded as Zarr arrays, which is similar to option b) but does not require introducing any new Zarr behavior. Additionally, the files themselves can carry metadata in their own .zattrs and be chunked. However, all tools that wish to consume them must be Zarr-aware.

Option d) String

Metadata can be encoded as a single (albeit large) string within .zattrs. Depending on Design Decision #2, storing a single string with the metadata has the advantage of working with existing formats as well as consolidating the metadata, but it does require escaping, etc.


Design decision #2: Format of metadata

Similarly to #1, currently all metadata is stored as JSON within .zattrs.

Option a) Design a JSON format

The option closest to the current NGFF process would consist of specifying a new JSON format to capture all of the information in OME-XML. This process would likely be extended and would need to be maintained for some time. One route to achieving it would be to generate json-schema from the XSD using ome-types.

Option b) Use the JSON-LD syntax of OME-OWL

Using JSON-LD would keep the metadata in JSON but would make use of the existing work on OME-OWL, and therefore not create another format that needs supporting. Additionally, the JSON-LD model provides an extensibility that is needed within the community. The downside is increased complexity in the programming model.

Option c) Store the OME-XML directly.

Finally, if the first goal is to support the existing model, using the OME-XML model is likely the fastest route. Downsides include the general aversion felt towards XML as well as the need to map between XML elements/identities and objections specified within the JSON. There will also not be an extensibility (beyond the standard annotations) in the first instance.


Implementation reports

Below we enumerate possible implementations and (eventually) the status of investigations into each of them. If anyone else is interested in proposing (or especially prototyping) an implementation, please mention so below.

1b2c: standardize the current bioformats2raw format

Standardizing the bioformats2raw output would require:

An additional benefit of this implementation is that the current bioformats2raw code can be adopted as the official .

Related issues:

joshmoore commented 2 years ago

Quick update that with the script below as well as https://github.com/ome/ome-zarr-py/pull/174, it's possible to build a Python reader of the bioformats2raw output. The next step would likely be an implicit convention or explicit metadata for mapping between the IDs in the OME-XML and the groups in the Implicit spec. (Currently raw2ometiff assumes the values are the offsets into the array.)

ome_types parser ``` #!/usr/bin/env python import ome_types import os import re import sys import tempfile import xml import zarr from ome_zarr.io import parse_url from ome_zarr.reader import Reader from xml.etree import ElementTree as ET def fix_xml(ns, elem): """ Note: elem.insert() was not updating the object correctly. """ if elem.tag == f"{ns}Pixels": elem.append(ET.Element(f"{ns}MetadataOnly")) def parse_xml(filename): # Parse the file and find the current schema root = ET.parse(filename) m = re.match(r'\{.*\}', root.getroot().tag) ns = m.group(0) if m else '' # Update the XML to include MetadataOnly for child in list(root.iter()): fix_xml(ns, child) fixed = ET.tostring(root.getroot()).decode() # Write file out for ome_types with tempfile.NamedTemporaryFile() as t: t.write(fixed.encode()) t.flush() return ome_types.from_xml(t.name) def handle_ome_zarr(filename): reader = Reader(parse_url(filename)) import pdb; pdb.set_trace() for node in reader(): print("Found node", node) metadata = node.zarr.subpath("OME/METADATA.ome.xml") print("Looking for metadata in ", metadata) if os.path.exists(metadata): data = parse_xml(metadata) print(data) if __name__ == "__main__": import argparse import logging parser = argparse.ArgumentParser() parser.add_argument("-v", "--verbose", action="store_true") parser.add_argument("filename", nargs="+") ns = parser.parse_args() if ns.verbose: logging.basicConfig(level=logging.DEBUG) for x in ns.filename: print(f"Handling {x}") handle_ome_zarr(x) ```
sbesson commented 2 years ago

Possibly in scope for this work - see the discussion happening in https://github.com/ome/ngff/issues/107 around the confusion created bv the usage of integers for group names.

Except in the HCS case, the current bioformats2raw layout heavily relies on this integer-based naming scheme as it allows to map the individual Zarr multiscales groups via name to the corresponding Image element in the OME-XML metadata. Relaxing this constraint would likely mean storing this relationship in the metadata e.g. using indices similarly to what has been done in https://github.com/ome/ngff/pull/24. I suspect that also overlaps with the considerations in https://github.com/glencoesoftware/bioformats2raw/issues/126.

chris-allan commented 2 years ago

Obviously, we're closest to 1b2c being usable but it's the least comfortable; like a jacket that's functional but just doesn't fit right. Given this uncomfortable position and certainly the community desire for more thorough use of Zarr group/array metadata and JSON markup I'd suggest that if we decide to formalize it maybe we call it OME-NGFF Transitional or something to that effect?

What follows is some background for how bioformats2raw got to where it currently is.

History

For the benefit of everyone let me first outline the history of how bioformats2raw got to where it is and hopefully that will help people understand the motivations behind its current structure. I'll just state for the record that the initial design decisions behind both bioformats2raw and raw2ometiff were made solely by Glencoe Software and that the primary domains we operate in are whole slide imaging and high content screening. The use cases we strive to support closely reflect our customer base and don't necessarily completely overlap with those of the OME community as a whole.

When work on bioformats2raw started back in October 2019 the repository was called mrxs2ometiff. Like the wider OME community ^1 we were frustrated with the usability and performance problems associated with a growing number of proprietary file formats. So were our customers. The Glencoe Software and University of Dundee teams wrote ^2 about this as it applies to digital pathology in May 2019. Given the OME tooling available and the difficulties involved in performing real time translation of the MRXS file format converting to OME-TIFF made sense. However, adding an MRXS reader to core Bio-Formats was out of the question and bfconvert had serious performance and scalability problems. Consequently, the project was born.

To date, bioformats2raw remains the only place in the OME ecosystem where support for formats like MRXS and BioTek Cytation are present. mrxs2ometiff and now the combination of bioformats2raw and raw2ometiff are responsible for converting 10s of TBs of our customers' whole slide imaging data into OME-TIFF every week for use with OMERO, QuPath, Vitessce and numerous other open source tools as well as commercial visualization and analysis software like HALO and Visiopharm.

Around the same time, we were commissioned by several of our customers to develop tooling to convert Philips' iSyntax file format into OME-TIFF. Due to the nature of Philips' file format at the time the only option was to depend on Philips' SDK to support this conversion and the only programming language the SDK was available in was Python. Rewriting the entirety of high performance TIFF writing and OME-XML support present in Bio-Formats and mrxs2ometiff in Python was again out of the question. "Two stage" conversion was born, mrxs2ometiff was split into bioformats2raw and raw2ometiff and the isyntax2raw project started. We wrote about this ^3 in December 2019.

All of this work predates the formalization of OME-NGFF and as stated in the aforementioned blog post "...converts to a temporary N5 or Zarr structure" and "The current N5 or Zarr intermediate format should be considered temporary at this stage of development as it is likely to undergo several changes over the coming months." The initial default intermediate format used was initially N5. Zarr support was poor at best and completely broken at worst.

isyntax2raw and raw2ometiff remain the only place in the OME ecosystem where support for formats like iSyntax are present and these tools are also responsibly for converting 10s of TBs of our customers' data into OME-TIFF every week.

bioformats2raw was not compatible with OME-NGFF until version 0.3.0, nearly 4 months after the OME-NGFF 0.2.0 specification was published.

I hope this helps everyone understand that as things stand today, the primary use case in the wild for bioformats2raw is as a vehicle for conversion into OME-TIFF via raw2ometiff. This does not mean we think this use case deserves some kind of preferential treatment but rather that it be considered equally. Furthermore, the individuals who perform this use case are not going to be watching GitHub or posting on image.sc.

Rationale for some design decisions

  1. What is OME/METADATA.ome.xml and why does it exist?

The content of OME/METADATA.ome.xml is the full OME-XML document produced by Bio-Formats reflecting the OME data model metadata for all images Bio-Formats recognizes in the specified input file. That is, at a fundamental level bioformats2raw does not convert images it converts filesets. We need this metadata to deliver on the conversion to OME-TIFF use case. It is also expected that this metadata be easily resolvable against the OME-NGFF structure to simplify the job of raw2ometiff.

  1. Why does bioformats2raw extend the hierarchy?

As aforementioned, bioformats2raw converts filesets. In Bio-Formats and OMERO parlance, each image of a fileset is identified by its "series". "series" is an ascending number starting from 0 and the order is tighly controlled between Bio-Formats versions ^4. Within the Bio-Formats API, this is the only way of identifying a particular image. Consequently, it makes most sense for each image in the bioformats2raw output to reflect its series. Otherwise, at least with the currently available tooling and without a great deal of additional effort, it is both impossible to reproduce the correct ordering as far as Bio-Formats is concerned and impossible to relate each image to its corresponding <Image> block within OME/METADATA.ome.xml.

These criteria are essential for being able to drive raw2ometiff to produce correct OME-TIFF output and why we state emphatically in the documentation that fiddling with --scale-format-string may break compatibility. We probably should go further than that in saying that it may also break compatibility and resolution against OME/METADATA.ome.xml as well. Furthermore, these criteria reflect the way in which OMERO initializes Bio-Formats within the PixelBuffer infrastructure to get access to the correct pixel data corresponding to a particular Image. We use the output of bioformats2raw in combination with OMERO and OMERO microservices extensively.

  1. What is this top level group and what does the metadata (ex. bioformats2raw.layout) mean?

Given the bioformats2raw legacy as well as the current lack of formalization surrounding the concept of a fileset we needed a way to differentiate the historical bioformats2raw output layouts (there have been several) to downstream tooling like raw2ometiff.

will-moore commented 2 years ago

I seem to remember that Vitessce made use of the METADATA.ome.xml in the Zarr from bioformats2raw, but I don't see any mention of that now http://vitessce.io/docs/data-file-types/#rasterome-zarr. @keller-mark @ilan-gold Am I wrong or has something changed?

keller-mark commented 2 years ago

Vitessce currently uses the metadata from the .zattrs to support loading OME-NGFF via Zarr. Vitessce also supports a (pre-OME-NGFF) Bioformats-Zarr format, which I believe makes use of METADATA.ome.xml (@ilan-gold or @manzt would know better) https://github.com/hms-dbmi/viv/blob/master/src/loaders/zarr/index.ts#L24 (in the Vitessce docs, this corresponds to raster.json with "type": "zarr")

imagesc-bot commented 2 years ago

This issue has been mentioned on Image.sc Forum. There might be relevant details there:

https://forum.image.sc/t/collections-in-ome-ngff/63656/6

joshmoore commented 2 years ago

@keller-mark, thanks! @ilan-gold / @manzt: any thoughts on the value (or perhaps the cost) of trying to focus on Vitessce as the client for this work as opposed to updating vizarr to make use of OME-XML?

ilan-gold commented 2 years ago

If you were to officially focus on OMEXML as the fastest route to full metadata support, we would probably just add this to the core Viv library and then add a loader to Vitessce for it, neither of which would take too long. I can't comment on Vizarr though.

joshmoore commented 2 years ago

@ilan-gold : that's certainly the current goal of https://github.com/ome/ngff/pull/112. As the only other reader implementation, if you have any lessons learned to add on top of http://api.csswg.org/bikeshed/?url=https://raw.githubusercontent.com/joshmoore/ngff/bf2raw/latest/index.bs#bf2raw, do let me know.

ilan-gold commented 2 years ago

@manzt and @joshmoore I am not sure if even Vizarr obeys the directive "SHOULD parse all images" - this seems taxing to do over HTTP and I was under the impression that it is something that was to be avoided because the metadata can often be inferred. Other than this, I don't any comments. Very exciting!

joshmoore commented 2 years ago

this seems taxing to do over HTTP and I was under the impression that it is something that was to be avoided because the metadata can often be inferred

Fair point. I think that's more a wording issue than an intent. I mean that clients should not ignore those images, by for example not disclosing their existence to a user.

manzt commented 2 years ago

It should be straightforward to find and return this metadata for OME-NGFF once a format/location are decided on. As it stands, Viv is fairly unopinionated about what the metadata is; it's loaders more or less find OME-XML/.zattrs and the client (Vitessce/Vizarr) is responsible for choosing what to do with it.

Given that Vitessce supports both OME-TIFF/OME-NGFF, I'm guessing it will be easier to display OME Metadata out of the box (compared to Vizarr) once support in Viv is added since it is already configured to display/use OME-XML for OME-TIFF.

joshmoore commented 2 years ago

https://github.com/ome/ngff/issues/104#issuecomment-1092706651 this seems taxing to do over HTTP and I was under the impression that it is something that was to be avoided

Pushed to #112, @ilan-gold, but reading it, I almost wonder if readers MUST detect the presence and SHOULD make it clear to users but only MAY show multiple images. Tricky.

https://github.com/ome/ngff/issues/104#issuecomment-1094077490 It should be straightforward to find and return this metadata for OME-NGFF once a format/location are decided on.

Glad to hear it. Is there anything that should happen from our side to move this forward?

One thing to note: there will be a next spec that will replace this one and be more explicit, e.g., the location of the XML might be configurable. But considering the amount of bf2raw 0.4.0 data that's out there, supporting this one may make sense. (If you think it's useful, we can also write down the earlier specs)

ilan-gold commented 2 years ago

Pushed to #112, @ilan-gold, but reading it, I almost wonder if readers MUST detect the presence and SHOULD make it clear to users but only MAY show multiple images. Tricky.

This certainly sounds the most like the behavior you're going for.

Glad to hear it. Is there anything that should happen from our side to move this forward?

Not to my knowledge, but perhaps I missed something here - do we not support something at the moment? Happy to remedy that, but I thought this was a proposal for future implementations

joshmoore commented 2 years ago

I think an older version is supported, and by specifying the current one, I was hoping to give everyone in the community enough confidence to implement that while we worked on the replacement.

ilan-gold commented 2 years ago

@manzt is that what https://github.com/hms-dbmi/viv/issues/403 is referring to? Do you want to lay out any pitfalls before I do this or respond to what Josh has said?

imagesc-bot commented 1 year ago

This issue has been mentioned on Image.sc Forum. There might be relevant details there:

https://forum.image.sc/t/using-bioformats2raw-for-creating-ome-zarr-scale-format-string/72716/6

aliaksei-chareshneu commented 1 year ago

Dear all,

Could you tell me please if it is currently possible to convert OME TIFF without pyramids to OME NGFF?

Thank you for any input, Best regards, Aliaksei

will-moore commented 1 year ago

Hi @aliaksei-chareshneu - yes, certainly. Either with https://www.glencoesoftware.com/products/ngff-converter/ or the underlying https://github.com/glencoesoftware/bioformats2raw command-line tool. Both use Bio-Formats to read files, so they support all the formats that Bio-Formats supports.

aliaksei-chareshneu commented 1 year ago

Hi @aliaksei-chareshneu - yes, certainly. Either with https://www.glencoesoftware.com/products/ngff-converter/ or the underlying https://github.com/glencoesoftware/bioformats2raw command-line tool. Both use Bio-Formats to read files, so they support all the formats that Bio-Formats supports.

@will-moore, thank you very much. Could you tell me please if it would result in some loss of metadata?

will-moore commented 1 year ago

@aliaksei-chareshneu There shouldn't be a loss of metadata. Both options will generate OME NGFF data in the bioformats2raw-layout, which includes an ome.xml file at image.zarr/OME/METADATA.ome.xml. This will contain all the OME metadata that bioformats can read from the input file.

imagesc-bot commented 9 months ago

This issue has been mentioned on Image.sc Forum. There might be relevant details there:

https://forum.image.sc/t/microscopy-metadata-in-zarr-files/87399/2