ome / ngff

Next-generation file format (NGFF) specifications for storing bioimaging data in the cloud.
https://ngff.openmicroscopy.org
Other
119 stars 41 forks source link

unit -> units #168

Closed d-v-b closed 1 year ago

d-v-b commented 1 year ago

Change unit to units. units is the name used by the CF Conventions, and thereby the broader climate science community -- the xarray ecosystem uses the units keyword (see https://docs.xarray.dev/en/stable/examples/multidimensional-coords.html, and https://pint-xarray.readthedocs.io/en/stable/creation.html).

Unless there's a very compelling reason for the singular unit, we should probably aim for compatibility with the climate science community (and their nice tools) and use units.

I also added clarification that the name and type attributes should be strings.

I thought I would need to change the schema, but it looks like units is already in there?

(Update, June 20, 2023): this issue has generated a lively discussion and @joshmoore requested that I summarize the pros and cons of the proposed change:

Arguments in favor of keeping unit:

Arguments in favor of switching to units:

github-actions[bot] commented 1 year ago

Automated Review URLs

will-moore commented 1 year ago

See https://github.com/ome/ngff/pull/157 ;)

will-moore commented 1 year ago

I think I have a slight preference for "units", which is probably why I used units incorrectly in the schemas - Fixed by #157. That PR should still be valid, since for v0.4 we did have "unit". So if we switch back to "units" for v0.5 then we'd need another PR to switch the schemas back again!?

d-v-b commented 1 year ago

Is there a blocker here or can we go ahead and merge this?

will-moore commented 1 year ago

Since #157 is merged now, could you rebase/merge so that those changes don't show up in this PR as its harder to understand, thx.

will-moore commented 1 year ago

cc @joshmoore @sbesson - you guys OK with this proposal?

sbesson commented 1 year ago

At least from my side, I don't have a compelling reason to use the singular unit form. From a specification perspective, I have no objection to this proposal if the community feels it brings value. Two comments:

d-v-b commented 1 year ago

@constantinpape The main goal here is harmonizing our metadata with the CF conventions, and they use units for this field. I don't think the term "units" is at variance with usage of the term in english scientific writing -- I do commonly see "units" used to describe a single base unit.

But it's reasonable to ask "why is 'units' conventional, rather than just 'unit'". I suspect this because base units are often combined to form compound units, for which the plural form becomes natural. Thus, although the value of units is a string, we can think that string denoting "the combination of units for this axis". That combination may be simple, e.g. meters (or should it be meter 🤔 ), in which case there is just 1 unit, but an axis may also be associated with compound units, e.g. meter * seconds, which is the case for most confocal images if we want to be pedantic about it (because samples along each row and column are separated by space and time).

Alternatively, maybe "units" is used because an axis denotes an interval, and the length of that interval is measured as a number of ($unit)s? E.g., we say "10 meters" and not "10 meter"?

But these theories are just speculation on my part. I searched for some justification for the plural form in the climate science specifications and couldn't find anything. Ultimately, I'm willing to tolerate the pluralized word in exchange for consistency with climate metadata conventions.

d-v-b commented 1 year ago

@sbesson to your point about handling breaking changes, I modified this PR to a) support the use of the field unit, guaranteeing backwards compatibility, and b) include a suggestion to the effect that implementers of the spec should give users a deprecation warning when an element of axes has the field unit.

will-moore commented 1 year ago

It looks like the change to the schema has been lost?

"Units" makes sense to me because you tend to use the plural for describing units. E.g. "10 millimetres" the units are millimetres.

d-v-b commented 1 year ago

@will-moore thanks for spotting that issue with the schema, it should be fixed now

jbms commented 1 year ago

I don't really have an opinion about what the ideal name is, though I did just check and see that in the udunits library, the specific unit of "millimeters", etc. would be called a "unit", and "units" refers to the plural of that, i.e. multiple different units:

https://docs.unidata.ucar.edu/udunits/current/udunits2-base.xml https://docs.unidata.ucar.edu/udunits/current/udunits2-derived.xml

It isn't particularly difficult to support this minor change in any given implementation, but it adds implementation complexity and likely will lead to incompatibility in the future, as some implementations may only support one of the names.

As far as I understand, this is attempting to use the same name as in the CF conventions spec, but does not actually aid compatibility with xarray or other software that follows CF conventions because xarray is actually using the name in a different place in the JSON metadata.

d-v-b commented 1 year ago

but does not actually aid compatibility with xarray or other software that follows CF conventions because xarray is actually using the name in a different place in the JSON metadata.

Empirically, using the exact same name as xarray / the cf conventions is quite convenient compared to using a name that is the same sans one letter. Xarray coordinate variables, as an in-memory representation, do not have JSON metadata so I'm not sure how to interpret this part. But regardless of where xarray, (or any other library that approximates the CF conventions) puts the word "units", as long as we agree that ome-ngff "unit" and xarray "units" describe the same concept, we should consider harmonizing the word we use for this concept, unless there is a good reason to diverge from CF conventions / xarray.

I will note this is not just a CF conventions / xarray thing. NIFTI and NRRD also use units.

It isn't particularly difficult to support this minor change in any given implementation, but it adds implementation complexity and likely will lead to incompatibility in the future, as some implementations may only support one of the names.

Regarding implementation complexity, I am aware of that, which is why this PR is backwards compatible (i.e., it allows both unit and units). I think this is the simplest way to amend the spec, but if there's an even simpler proposal that achieves the same aim, I am open to suggestions. But we shouldn't trade implementation complexity for user complexity -- diverging from other metadata standards adds needless complexity to users and programs that work in this ecosystem.

jbms commented 1 year ago

As far as xarray, I'm not too familiar with it, but I assumed that it has a way to load/save this attribute from zarr metadata. Perhaps that is not the case, and it is merely an identifier used in the xarray API.

From a quick look, I think "unit" is pretty much universally used as the singular term, as in SI base unit like "meter" or derived unit like "5 meters", and "units" is the plural form:

https://simple.wiktionary.org/wiki/unit

Since this attribute is specifying a single unit, not multiple units, the singular form would seem to make more sense. Note that in both NIFTI and NRRD, the "units" attribute is specifying more than one unit (in case of NRRD, a unit for each dimension, and for NIFTI, a unit for space and a unit for time), so the plural makes sense there.

d-v-b commented 1 year ago

Since this attribute is specifying a single unit, not multiple units, the singular form would seem to make more sense.

We discussed conventional english usage of the word "unit" and "units" earlier in the thread. English itself is inconsistent in when units of measure are pluralized. Since English already pluralizes units contextually, Is there a legitimate concern that people will be confused here?

d-v-b commented 1 year ago

An additional data point: DICOM uses MeasurementUnits to name the metadata describing the units associated with a measured quantity.

mkitti commented 1 year ago

If I was a saving a velocity field from my particle image velocimetry experiment, in meters per second, then the. I think units may apply in that case.

Since we may be mostly talking about pixel pitch for images, perhaps what we commonly mean is a distance measure per pixel, e.g. nanometers per pixel. Units would also be applicable here.

Regarding implementations and backwards compatibility, this is a pre-1.0 specification. Under most understandings of versioning schemes, including semantic versioning, there are few if any stability guarantees. In other words, now is the time to break things.

https://semver.org/#spec-item-4

jbms commented 1 year ago

Since this attribute is specifying a single unit, not multiple units, the singular form would seem to make more sense.

We discussed conventional english usage of the word "unit" and "units" earlier in the thread. English itself is inconsistent in when units of measure are pluralized. Since English already pluralizes units contextually, Is there a legitimate concern that people will be confused here?

I happened to see that stackexchange page as well, but it seems to be related to the question of whether a specific unit name should be singular or plural, e.g. "1 meter" vs "2 meters" vs "2-meter pole". I don't see anything on that page about whether the word "unit" itself should be singular or plural.

It is true that there are a lot of inconsistencies in English. But overall it seems to me that "unit" is probably at least somewhat preferred over "units" in terms of common usage for this particular case.

The spec does also reference UDUNITS so there may be some value in being consistent with the terminology there. According to udunits, a unit like "meters per second" or "0.7 meters per second" would be considered a "derived unit" (singular).

In any case, I'm not worried about confusion when a person is reading the metadata ---- the meaning will be obvious even without reading the specification.

If this change introduced actual technical compatibility with CF conventions software, I think there would be a reasonable argument to do it. As it is, though, it is merely making the names more similar but does not improve compatibility.

If some existing implementations of ome-zarr are already using units, that would also be a reasonable argument for supporting both.

Although supporting both would be simpler implementation-wise than having to condition on the version, if this change is to be made, I would actually strongly urge that both "unit" and "units" NOT be supported. As I see it, this type of metadata is the sort of thing that is likely to be generated and parsed directly by all sorts of adhoc scripts; each of these scripts can be thought of as OME-zarr implementations. Therefore as this format becomes more popular we can expect there may be a very large number of independent (partial) implementations, and I expect many of them will support only one of "unit" or "units", probably whichever was used in the dataset used as a reference when writing the code.

joshmoore commented 1 year ago

jbms commented [7 hours ago] (https://github.com/ome/ngff/pull/168#issuecomment-1592116992) if this change is to be made, I would actually strongly urge that both "unit" and "units" NOT be supported.

I'd second that, or if the support for the two was meant to be transitional, that we discuss that timeline.

mkitti commented [7 hours ago] (https://github.com/ome/ngff/pull/168#issuecomment-1592098366) Regarding implementations and backwards compatibility, this is a pre-1.0 specification. Under most understandings of versioning schemes, including semantic versioning, there are few if any stability guarantees. In other words, now is the time to break things.

In general, I'd agree that pre-1.0 is the time to break things if necessary, but as I mentioned to @d-v-b during the Zarr call, I would really urge us to also consider the costs (i.e. just because it's sub-1.0 doesn't mean we should break things).

For example, there was a mention that this was just a one letter change. I'll reference my well-worn story of making it possible to support "/" in addition to "." at the Zarr-level: that took over 6 months of work. I don't think this is nearly as complicated, but I would strongly urge anyone who is opening a PR to change the spec to engage with that overhead before asking the community to invest their time.

@d-v-b: you've brought up a few different reasons in the discussion above, can I suggest you update the description of this PR and summarize the various arguments pro and con? Also, if there's not a clear consensus on the value of changing right now, identifying a point in time when there's going to be a breaking change anyway, might help to draw support.

d-v-b commented 1 year ago

I'd second that, or if the support for the two was meant to be transitional, that we discuss that timeline.

The goal was to make the support for both unit and units transitional, but I'm happy making a hard break from unit to units, if people prefer that. Are we explicit anywhere about what major vs minor versions mean with respect to backwards compatibility?

In general, I'd agree that pre-1.0 is the time to break things if necessary, but as I mentioned to @d-v-b during the Zarr call, I would really urge us to also consider the costs (i.e. just because it's sub-1.0 doesn't mean we should break things).

I'm not proposing this change for fun; I think it materially improves the spec. The cost of changing the spec and the cost of enduring warts in the spec both increase with time, so it's imperative that we change things earlier rather than later... assuming we agree on the change 😅

@d-v-b: you've brought up a few different reasons in the discussion above, can I suggest you update the description of this PR and summarize the various arguments pro and con? Also, if there's not a clear consensus on the value of changing right now, identifying a point in time when there's going to be a breaking change anyway, might help to draw support.

That's a good idea. I will post a summary of pros and cons here, and also add it to the description:

Arguments in favor of keeping unit:

Arguments in favor of switching to units:

d-v-b commented 1 year ago

What do we need to get this closed? Should I rebase on top of the coordinate transformations PR?

d-v-b commented 1 year ago

After discussion over in zulip, it's clear that this proposal does not add enough value to warrant inclusion in the spec.