ome / ngff

Next-generation file format (NGFF) specifications for storing bioimaging data in the cloud.
https://ngff.openmicroscopy.org
Other
115 stars 38 forks source link

Referencing zarr stores, groups, arrays, and attributes #144

Open bogovicj opened 2 years ago

bogovicj commented 2 years ago

There are many use cases in v0.4 spec and in PRs that need to reference zarr arrays or groups:

References to other data in all of these cases should be consistent.

@axtimwalde and I like what @jbms does with neuroglancer and discusses here: https://github.com/zarr-developers/zarr-specs/issues/132 a "URL" syntax would be great for storing references as json strings as needed above, and for enabling users to point software to to zarr stores / groups / arrays.

It would be greate to use whatever standard is reached at the zarr level, but we'll need something soon / now.

Though, a full URL might be overkill for multiscales, or when referencing a set of groups / arrays that are the immediate children of "this" group in the same zarr store. (?)

ivirshup commented 2 years ago

For expediency and compatibility, I think it makes sense to have a two argument version. I would propose we allow path arguments to be either

Draft json-schema ```json { "$schema": "https://json-schema.org/draft/2020-12/schema", "$id": "https://ngff.openmicroscopy.org/latest/schemas/array_path.schema", "title": "OME Array Path", "description": "Path to an array in an OME-NGFF store", "oneOf": [ { "type": "object", "properties": { "path": { "type": "string", "description": "Local path to array" }, "store_uri": { "type": "string", "format": "uri", "description": "URI to root of store" } }, "required": ["path"] }, { "type": "string", "description": "Local path to array" } ] } ```

Even though I'd really like to have a single relative URI for paths. I think this won't work out right now since we'll need to know where the root group is.

When a standard is figured out at the zarr level, I think we'd be able to just deprecate the two-argument variant/ add another case for zarr urls.

Would love a suggestion for us just being able to use a URI though.

joshmoore commented 2 years ago

Discussing Zarr V3 with @jbms and @jstriebel yesterday, there were a number of options that were discussed around the external URI syntax for Zarr:

Most likely a change proposal will be raised to have URLs point to the metadata file in question (dropping the /meta subpath) and to require that the store directory end in .zarr.

To @ivirshup's points, I have a knee jerk reaction against the oneOf solution since it feels like it will introduce baggage that we will need to carry around for a while. I would personally almost prefer to move to object syntax across the board if it's needed at all. Another alternative might be allowing different handlers depending on a prefix.

Edit: and from the meeting Wed., I'll add that another option would be to specify the "root" in the OME-Zarr metadata in order to workaround the Zarr V2 limitation.

joshmoore commented 2 years ago

Quick update from a brainstorming session with @kevinyamauchi and @bogovicj:

An export of the scratch worksheet we used is available for download: NGFF URI Use cases.pdf

ivirshup commented 2 years ago

I have a knee jerk reaction against the oneOf solution since it feels like it will introduce baggage that we will need to carry around for a while.

I also don't love this. My original draft of this proposal was to have both "only object" and "object or string" schemas.

The main reasons I went with oneOf above:

"New field proposal"

This means anywhere that currently looks like { "path": "s3:...", ... } could instead look like { "path": "s3:...", "access_key": "PASSWORD123", ... }, right? I would be worried about namespace collisions. Also, how would this work for lists of paths?

joshmoore commented 2 years ago

I would be worried about namespace collisions.

Definitely. We'd need to work that out carefully. Two possibilities come to mind: either putting everything in a well-defined subject ({"content": {"type": "s3", "access_key": ...}) or introducing a proper prefixing mechanism ({"path": ..., "s3:access_key": ...})

Also, how would this work for lists of paths?

So far we don't have those. We only have list of objects that have a path.

ivirshup commented 2 years ago

how would this work for lists of paths?

So far we don't have those.

The table spec has this:

https://github.com/ome/ngff/blob/cc83a82c716670fb60d2d7f8a89f4f700a17b788/latest/index.bs#L193

joshmoore commented 2 years ago

Yup, but there's definitely still time to influence that spec which is why it's a good time to nail down the idiom(s) we want to use for path references now.

kevinyamauchi commented 1 year ago

Hello. Just checking in here to see what the status of this is. I think this is the final blocking thing for https://github.com/ome/ngff/pull/64 and I think this is also needed for #138 .

If we can't settle on the long term solution now, perhaps we should aim for an intermediate solution to unblock #64 and #138 and then clean up paths in 0.6.0?

joshmoore commented 1 year ago

Definitely still up for the possibility of kick this down the road, but in the way of polling those who are following along, I'll create three comments following this one for each of the proposals I listed https://github.com/ome/ngff/issues/144#issuecomment-1272351065 to see if we can get a sense of what people are thinking. Please vote with an emoji reaction on the comment itself.

joshmoore commented 1 year ago

Choice #1: "Protocol proposal": relpath:foo, abspath:/foo, etc.

joshmoore commented 1 year ago

Choice #2: "Object proposal": {"path": {"type": "s3", "endpoint": "https:...", "path": "a/b/c", "access_key": "..."}

joshmoore commented 1 year ago

Choice #3: "New field proposal": {"path": ..., "s3:access_key": ...}

sbesson commented 1 year ago

A few comments while reading this issue and based on Glencoe’s experience of dealing with NGFF on object storage, authentication and authorization.

Most of the concepts described above are geared towards an Amazon S3 implementation. As reinforced during the ongoing OME2022 community meeting, this is undoubtedly one of the primary use cases and it makes sense for the community to start there. As this issue aims to formalize such data access patterns, it might be useful to also think how one would support other object storage solutions. In proposals 2 and 3, does the s3 prefix/object type represent any S3 compatible storage where different implementations might require specific keys? Or would a new object storage technology require the definition of a new type/prefix?

Most importantly, I would strongly discourage this specification from suggesting to store credentials such as access/secret keys in the .zattrs metadata. This inherently carries a series of security risks especially as publication is one of the desired outcomes for these scientific datasets. It’s also technically dubious as many organizations forbid issuing long lived credentials or rely on other authorization techniques such as credential profile files, container or instance profiles.

/cc @chris-allan

joshmoore commented 1 year ago

Briefly, in any examples from my side "s3" is just a stand-in for something that can be readily understood in examples, since this is more a question of syntax and a discussion on the actual content would still need to be had. (Also agreed on passwords, but there the differences in implementations aren't going to make our lives any easier...)

axtimwalde commented 1 year ago

Just wanted to let you know that @bogovicj and @cmhulbert discussed a real world use case and we decided that we want to try this with the following single string URL specification:

$URL = $CONTAINER_URL[?$GROUP_PATH][#$ATTRIBUTE_PATH]

This is fully compatible with standard URL schema, we want to address an entry in a DOM tree that consists of a container, a group, and an attribute. This allows to address containers, groups, and attributes alone or in arbitrary combinations, including relative references to the current context just like usual in URLs. To separate the three components we use the three components of URL schema, the scheme specific path for the container, the query for the group, and the fragment for the attribute. Format discovery is on the client and on the protocol, e.g. an HTTP server can provide a meaningful Content-Type header, and, on the filesystem, we can use the usual tricks, look at file endings and/ or magic bytes, and or indicative files present in the directory tree. Note that this can address not just Zarr but also other formats like HDF5, N5, OME_TIFF, TENSOR_STORE, ...

Examples for $CONTAINER_PATH:

Like in standard URL schema, an empty scheme means file:, and a rootless path is relative, an empty URL refers to the 'current' container in the open context.

$GROUP_PATH is the path to the group or dataset, a rootless path is relative, an empty path refers to the 'current' container and group path in the open context. Examples:

$ATTRIBUTE_PATH is the path to the attribute in the attributes DOM of the group, a rootless path is relative, an empty path refers to the 'current' container, group path, and attribute in the open context. Forward slashes separate tree levels. array elements are indexed in square brackets. Many forward slashes are equivalent to one forward slash, level separation is optional before and after an index. Examples:

Relative path traversal with ../ is allowed but stops at the fragment and query boundaries respectively. I.e. a relative attributes path in the fragment cannot walk into the groups path in the query and a relative group path in the query cannot walk into the container path in the scheme specific path.

Examples for complete URLS:

imagesc-bot commented 1 year ago

This issue has been mentioned on Image.sc Forum. There might be relevant details there:

https://forum.image.sc/t/cli-programmatic-browsing-of-ome-zarr-hierarchies-on-idr/75907/7

joshmoore commented 1 year ago

During today's ZEP meeting, https://github.com/radiantearth/stac-spec/blob/master/collection-spec/collection-spec.md#link-object was pointed to by Ryan Abernathey a spec from STAC which allows linking to "self", "root", "parent", "child"...)

Possibly to be proposed at the Zarr level.

(Doesn't play well with concurrent writing. Only appropriate for write-once)

jhnnsrs commented 10 months ago

Hi everyone,

i just wanted to revive this thread trying to wrap my head around the proposed path ideas. To give a bit of background and put my perspective into context, I am developing a "streaming analysis framework" for microscopy data (ongoing documentation efforts here, with an "omero like" data service, that comes with a GraphQL API ( for metadata and object retrieval), as well as (minio based) S3 storage for zarr-array storage.

When accessing an object, first a query is performed against the GraphQL API which then points directly to the zarr array stored on s3. The clients (after a “preflight” request to receive temporary access_key and secret_key), then directly access S3 to receive binary data. This direct access is (currently) highly necessary for achieving adequate performance.

I would really like to support the Ome-NGFF spec in addition, and provide on the fly ome-ngff conversion, through providing http endpoints (outside of graphql), that would convert the image and metadata models to ome-ngff compatible specs). Preferably i would like to integrate what i would call “materialized views” on the data:

When requesting an image as ome-ngff, a client would request https://the-data-micro-service/ngff/image/:id?as_of=”today” as a root element, that would then return the “Image spec” JSON, populating the image-metadata according to the information stored in the database (which does not map 1-1 to the ome-ngff spec), but pointing to the same zarr-arrays that i use internally. The same should hold true when requesting the multiwell or multiscale endpoint. I would really really, like to keep the database as primary source of truth, and only generate these views on demand (especially because of that as_of part)

Now I am not entirely sure if I understood how well the specification plays with this scenario already. But what i thought my problem would eventually boil down to is the following:

Client (with ome-ngff client library) access the https://the-data-micro-service/ngff/image/:idendpoint, as if it was the zattrs parses the JSON, and then tries to resolve paths, and would try to access https://the-data-micro-service/ngff/image/:id/path_to_array, rather than https://minio/:id/path_to_array.

While redirects on the server side could be an option, they don’t really play well with authentication. I hope that non-relative paths would fix that issue.

So long story short and the TLDR: What is the current state of this suggestion and is the current any consent about consolidated metadata that would live outside the file hierarchy ?Is there any ome-ngff compliant client libraries that can handle non relative paths, that i could test against? If not, where would be the best place to start and get things rolling? :)

Best and thanks a million for the efforts of this specification :)