Closed d-v-b closed 1 year ago
I think I have a slight preference for "units", which is probably why I used units
incorrectly in the schemas - Fixed by #157.
That PR should still be valid, since for v0.4 we did have "unit".
So if we switch back to "units" for v0.5 then we'd need another PR to switch the schemas back again!?
Is there a blocker here or can we go ahead and merge this?
Since #157 is merged now, could you rebase/merge so that those changes don't show up in this PR as its harder to understand, thx.
cc @joshmoore @sbesson - you guys OK with this proposal?
At least from my side, I don't have a compelling reason to use the singular unit
form. From a specification perspective, I have no objection to this proposal if the community feels it brings value.
Two comments:
axes
specification and will need to be updated with this change included /cc @bogovicj @constantinpape The main goal here is harmonizing our metadata with the CF conventions, and they use units
for this field. I don't think the term "units" is at variance with usage of the term in english scientific writing -- I do commonly see "units" used to describe a single base unit.
But it's reasonable to ask "why is 'units' conventional, rather than just 'unit'". I suspect this because base units are often combined to form compound units, for which the plural form becomes natural. Thus, although the value of units
is a string, we can think that string denoting "the combination of units for this axis". That combination may be simple, e.g. meters
(or should it be meter
🤔 ), in which case there is just 1 unit, but an axis may also be associated with compound units, e.g. meter * seconds
, which is the case for most confocal images if we want to be pedantic about it (because samples along each row and column are separated by space and time).
Alternatively, maybe "units" is used because an axis denotes an interval, and the length of that interval is measured as a number of ($unit)s? E.g., we say "10 meters" and not "10 meter"?
But these theories are just speculation on my part. I searched for some justification for the plural form in the climate science specifications and couldn't find anything. Ultimately, I'm willing to tolerate the pluralized word in exchange for consistency with climate metadata conventions.
@sbesson to your point about handling breaking changes, I modified this PR to a) support the use of the field unit
, guaranteeing backwards compatibility, and b) include a suggestion to the effect that implementers of the spec should give users a deprecation warning when an element of axes
has the field unit
.
It looks like the change to the schema has been lost?
"Units" makes sense to me because you tend to use the plural for describing units. E.g. "10 millimetres" the units are millimetres.
@will-moore thanks for spotting that issue with the schema, it should be fixed now
I don't really have an opinion about what the ideal name is, though I did just check and see that in the udunits library, the specific unit of "millimeters", etc. would be called a "unit", and "units" refers to the plural of that, i.e. multiple different units:
https://docs.unidata.ucar.edu/udunits/current/udunits2-base.xml https://docs.unidata.ucar.edu/udunits/current/udunits2-derived.xml
It isn't particularly difficult to support this minor change in any given implementation, but it adds implementation complexity and likely will lead to incompatibility in the future, as some implementations may only support one of the names.
As far as I understand, this is attempting to use the same name as in the CF conventions spec, but does not actually aid compatibility with xarray or other software that follows CF conventions because xarray is actually using the name in a different place in the JSON metadata.
but does not actually aid compatibility with xarray or other software that follows CF conventions because xarray is actually using the name in a different place in the JSON metadata.
Empirically, using the exact same name as xarray / the cf conventions is quite convenient compared to using a name that is the same sans one letter. Xarray coordinate variables, as an in-memory representation, do not have JSON metadata so I'm not sure how to interpret this part. But regardless of where xarray, (or any other library that approximates the CF conventions) puts the word "units", as long as we agree that ome-ngff "unit" and xarray "units" describe the same concept, we should consider harmonizing the word we use for this concept, unless there is a good reason to diverge from CF conventions / xarray.
I will note this is not just a CF conventions / xarray thing. NIFTI and NRRD also use units
.
It isn't particularly difficult to support this minor change in any given implementation, but it adds implementation complexity and likely will lead to incompatibility in the future, as some implementations may only support one of the names.
Regarding implementation complexity, I am aware of that, which is why this PR is backwards compatible (i.e., it allows both unit
and units
). I think this is the simplest way to amend the spec, but if there's an even simpler proposal that achieves the same aim, I am open to suggestions. But we shouldn't trade implementation complexity for user complexity -- diverging from other metadata standards adds needless complexity to users and programs that work in this ecosystem.
As far as xarray, I'm not too familiar with it, but I assumed that it has a way to load/save this attribute from zarr metadata. Perhaps that is not the case, and it is merely an identifier used in the xarray API.
From a quick look, I think "unit" is pretty much universally used as the singular term, as in SI base unit like "meter" or derived unit like "5 meters", and "units" is the plural form:
https://simple.wiktionary.org/wiki/unit
Since this attribute is specifying a single unit, not multiple units, the singular form would seem to make more sense. Note that in both NIFTI and NRRD, the "units" attribute is specifying more than one unit (in case of NRRD, a unit for each dimension, and for NIFTI, a unit for space and a unit for time), so the plural makes sense there.
Since this attribute is specifying a single unit, not multiple units, the singular form would seem to make more sense.
We discussed conventional english usage of the word "unit" and "units" earlier in the thread. English itself is inconsistent in when units of measure are pluralized. Since English already pluralizes units contextually, Is there a legitimate concern that people will be confused here?
An additional data point: DICOM uses MeasurementUnits
to name the metadata describing the units associated with a measured quantity.
If I was a saving a velocity field from my particle image velocimetry experiment, in meters per second, then the. I think units may apply in that case.
Since we may be mostly talking about pixel pitch for images, perhaps what we commonly mean is a distance measure per pixel, e.g. nanometers per pixel. Units would also be applicable here.
Regarding implementations and backwards compatibility, this is a pre-1.0 specification. Under most understandings of versioning schemes, including semantic versioning, there are few if any stability guarantees. In other words, now is the time to break things.
Since this attribute is specifying a single unit, not multiple units, the singular form would seem to make more sense.
We discussed conventional english usage of the word "unit" and "units" earlier in the thread. English itself is inconsistent in when units of measure are pluralized. Since English already pluralizes units contextually, Is there a legitimate concern that people will be confused here?
I happened to see that stackexchange page as well, but it seems to be related to the question of whether a specific unit name should be singular or plural, e.g. "1 meter" vs "2 meters" vs "2-meter pole". I don't see anything on that page about whether the word "unit" itself should be singular or plural.
It is true that there are a lot of inconsistencies in English. But overall it seems to me that "unit" is probably at least somewhat preferred over "units" in terms of common usage for this particular case.
The spec does also reference UDUNITS so there may be some value in being consistent with the terminology there. According to udunits, a unit like "meters per second" or "0.7 meters per second" would be considered a "derived unit" (singular).
In any case, I'm not worried about confusion when a person is reading the metadata ---- the meaning will be obvious even without reading the specification.
If this change introduced actual technical compatibility with CF conventions software, I think there would be a reasonable argument to do it. As it is, though, it is merely making the names more similar but does not improve compatibility.
If some existing implementations of ome-zarr are already using units
, that would also be a reasonable argument for supporting both.
Although supporting both would be simpler implementation-wise than having to condition on the version, if this change is to be made, I would actually strongly urge that both "unit" and "units" NOT be supported. As I see it, this type of metadata is the sort of thing that is likely to be generated and parsed directly by all sorts of adhoc scripts; each of these scripts can be thought of as OME-zarr implementations. Therefore as this format becomes more popular we can expect there may be a very large number of independent (partial) implementations, and I expect many of them will support only one of "unit" or "units", probably whichever was used in the dataset used as a reference when writing the code.
jbms commented [7 hours ago] (https://github.com/ome/ngff/pull/168#issuecomment-1592116992) if this change is to be made, I would actually strongly urge that both "unit" and "units" NOT be supported.
I'd second that, or if the support for the two was meant to be transitional, that we discuss that timeline.
mkitti commented [7 hours ago] (https://github.com/ome/ngff/pull/168#issuecomment-1592098366) Regarding implementations and backwards compatibility, this is a pre-1.0 specification. Under most understandings of versioning schemes, including semantic versioning, there are few if any stability guarantees. In other words, now is the time to break things.
In general, I'd agree that pre-1.0 is the time to break things if necessary, but as I mentioned to @d-v-b during the Zarr call, I would really urge us to also consider the costs (i.e. just because it's sub-1.0 doesn't mean we should break things).
For example, there was a mention that this was just a one letter change. I'll reference my well-worn story of making it possible to support "/" in addition to "." at the Zarr-level: that took over 6 months of work. I don't think this is nearly as complicated, but I would strongly urge anyone who is opening a PR to change the spec to engage with that overhead before asking the community to invest their time.
@d-v-b: you've brought up a few different reasons in the discussion above, can I suggest you update the description of this PR and summarize the various arguments pro and con? Also, if there's not a clear consensus on the value of changing right now, identifying a point in time when there's going to be a breaking change anyway, might help to draw support.
I'd second that, or if the support for the two was meant to be transitional, that we discuss that timeline.
The goal was to make the support for both unit
and units
transitional, but I'm happy making a hard break from unit
to units
, if people prefer that. Are we explicit anywhere about what major vs minor versions mean with respect to backwards compatibility?
In general, I'd agree that pre-1.0 is the time to break things if necessary, but as I mentioned to @d-v-b during the Zarr call, I would really urge us to also consider the costs (i.e. just because it's sub-1.0 doesn't mean we should break things).
I'm not proposing this change for fun; I think it materially improves the spec. The cost of changing the spec and the cost of enduring warts in the spec both increase with time, so it's imperative that we change things earlier rather than later... assuming we agree on the change 😅
@d-v-b: you've brought up a few different reasons in the discussion above, can I suggest you update the description of this PR and summarize the various arguments pro and con? Also, if there's not a clear consensus on the value of changing right now, identifying a point in time when there's going to be a breaking change anyway, might help to draw support.
That's a good idea. I will post a summary of pros and cons here, and also add it to the description:
Arguments in favor of keeping unit
:
unit
over units
(e.g.). But others prefer units
over unit
, e.g..unit
in its reference xml filesunit
becomes pluralized to units
, what about the other fields of axis
? e.g., should name
become names
, and type
become types
? Arguments in favor of switching to units
:
units
is more future-proof. If this spec ever allows axis objects to be multi-dimensional, or more generally allows describing the units of multidimensional quantities, then the units
field could evolve to a list of udunits-compatible strings without needing to change the name of the field itself. The singular unit
fails this test (this might be why CF uses units
, but i'm not sure).What do we need to get this closed? Should I rebase on top of the coordinate transformations PR?
Change
unit
tounits
.units
is the name used by the CF Conventions, and thereby the broader climate science community -- the xarray ecosystem uses theunits
keyword (see https://docs.xarray.dev/en/stable/examples/multidimensional-coords.html, and https://pint-xarray.readthedocs.io/en/stable/creation.html).Unless there's a very compelling reason for the singular
unit
, we should probably aim for compatibility with the climate science community (and their nice tools) and useunits
.I also added clarification that the
name
andtype
attributes should be strings.I thought I would need to change the schema, but it looks like
units
is already in there?(Update, June 20, 2023): this issue has generated a lively discussion and @joshmoore requested that I summarize the pros and cons of the proposed change:
Arguments in favor of keeping
unit
:unit
overunits
(e.g.). But others preferunits
overunit
, e.g..unit
in its reference xml filesunit
becomes pluralized tounits
, what about the other fields ofaxis
? e.g., shouldname
becomenames
, andtype
becometypes
?Arguments in favor of switching to
units
:units
is more future-proof. If this spec ever allows axis objects to be multi-dimensional, or more generally allows describing the units of multidimensional quantities, then theunits
field could evolve to a list of udunits-compatible strings without needing to change the name of the field itself. The singularunit
fails this test (this might be why CF usesunits
, but i'm not sure).