Open schwehr opened 3 years ago
Anything is possible. Are you picturing a keyword that does validation in some way, or is purely annotative?
At the schema level, I am only thinking annotative. I think it's up to an application to decide how to use the unit annotation should it exist. The most common case is likely with annotating visualizations of data with the units. Perhaps the user could turn to something like udunits to do a conversion. Unit conversion and range checking are two heavy weight tasks that don't seem to be reasonable within the spec of the schema.
At the edge of reasonable would be to give a list of accepted strings. But that could be large and likely users in diverse fields will have units that are not in the list. e.g. scipy units doesn't have magnetic units found at IEEE Magnetics
This is from our custom schema type that will hopefully become json-schema based. The user's code could check the schema on a collection type and at halt with an error if speed isn't what it expects. e.g. If a program only handles m/s
, but it finds knots
.
"max_wind_kts": {
"description": "Maximum wind speed",
"type": "DOUBLE",
"unit": "knots"
},
"min_pressure": {
"description": "Minimum pressure",
"type": "DOUBLE",
"unit": "millibars"
},
"numEntries": {
"description": "Number of points for a particular hurricane",
"type": "DOUBLE"
},
"radii_ne_34kt": {
"description": "34 kt wind radii maximum extent in northeastern quadrant",
"type": "DOUBLE",
"unit": "nautical miles"
},
@schwehr This would make for an excellent extension vocabulary, now that those are supported :-) It's a common problem, as you note, but probably not quite fundamental enough to go into the main spec.
Also, if you want types like DOUBLE
you're going to need a new keyword for that as well. The type
keyword is not extensible, and given how heavily used it is and how stable it has ben for many years, the bar for changing it is very high.
I think number
is what you're looking for instead of DOUBLE
. Though if you're looking to declare precision, perhaps a precision
keyword would be better suited.
{
"type": "number",
"precision": "double"
}
This would also live in your extension vocab.
Don't type and format provide enough flexibility for specifying the units?
A bit of a tangent, by JsonSchema's naming of type
and format
seems counter-intuitive to me. They'd be less confusing if they were reversed.
There are millions of types in the world. Yet the number of [serialization] formats is smaller. And if we talk about JSON, there are only a handful of ways something can be represented in JSON: string, number, boolean, array, object.
I think it would have been more intuitive to reverse the naming:
{
"type": "Nautical miles",
"format": "number"
}
{
"type": "Positive integer",
"format": "number"
}
{
"type": "E-mail address",
"format": "string"
}
{
"type": "JPEG image",
"format": "blob"
}
@Ark-kun type
correlates to the JavaScript (the JS in JSON) notion of types. format
has never been implemented consistently across implementations and should be replaced by coherent vocabularies of keywords.
I'm going to move this over to the vocabularies repo.
Being able to define a unit
for time periods would be very useful:
uptime:
type: integer
unit: microseconds
example: 50
Some possibly useful spec text here.
Like I said in https://github.com/OAI/OpenAPI-Specification/issues/2061#issuecomment-993795559, Unicode CLDR and ECMAScript Intl
both define the same composable string format as well as a set of base units and prefixes to allow some form of "universal" unit specification like this. Both only do so with the main use case of text internationalisation and display unit conversion, but co-opting this into a JSON Schema vocabulary would be enough to cover the vast majority of use cases around data structure documentation as well.
As for unusual/domain-specific units like decibel-milliwatts or tons of TNT or parts-per-octet "percentages" and whatnot, the CLDR notation allows for private-use units starting with xxx-
(formerly x-
), allowing users to also define custom units where the stock ones aren't sufficent.
This is still an annotative-only concept. We have some ideas cooking that may help in defining annotative keywords, but it's on hold until we sort out a few other things.
As it stands, there's nothing preventing someone from just creating (and publishing) a new vocabulary that defines the units
keyword, its valid values, and its intended use. The JSON Schema framework is already in place to support such a vocabulary.
It's good to see there's continued interest in this, though.
As it stands, there's nothing preventing someone from just creating (and publishing) a new vocabulary that defines the units keyword, its valid values, and its intended use. The JSON Schema framework is already in place to support such a vocabulary.
It's not even necessary to do that much, for a purely-annotative keyword, because the latest version of JSON Schema specifies that unknown keywords should be collected as annotations by default. If you're using json schemas in the context of OpenAPI 3.1.x, you may need to define a custom metaschema that allows for unknown keywords, however, but this is a simple matter of creating the data file, and no extra code is needed (nor additional support in implementations, as long as they are specification-conformant).
More standardization would be nice tought, so that for example a UI can render 9124 seconds
as 2 hours, 31 minutes
. Deep integration like that can only happen with some standardization across implementations.
Thanks all for the continued discussion. Units is still a serious pain point for me. This is my current list of units in in the system I work on.
'%', '% (kg / kg)', '(kg/m^3)/(m/s)',
'-', # Meaning dimensionless?
'Alfalfa, mm', 'Class',
'Coefficient of Variation', 'DN',
'DU', 'Day',
'degree', 'Degree', 'Degrees', 'Degrees clockwise from North',
'Dimensionless', # Meaning no units?
# https://en.wikipedia.org/wiki/Dobson_unit
'Dobson units', 'Dobsons', 'dobsons',
'Equivalent gauges per 2.5 degree box',
'Hours',
'J/kg', 'J/m2', 'J/m^2/day', 'Julian Day',
'K', 'Kelvin',
'MW', 'Megawatts',
'Mg C/ha', 'Mg ha^-1', 'Mg/ha',
'Minutes', 'N/m^2', 'NFDRS fire danger index', 'Number of upstream pixels',
'Number per pixel', 'Pa', 'Pa/s', 'Percent', 'percent', 'Pixels', 'pixels', 'Reflectance factor',
'Seconds', 'W m**-2', 'W m-2',
'W m^-2 sr^-1 μm^-1',
'W/(m^2*sr*um)/ DN',
'W/m^2',
'W/m^2 SRµm',
'W/m^2 SRμm',
'cm', 'cmol(+)/kg', 'cms', 'count', 'counts/day',
'dB', 'days', 'deg true', 'degree C', 'degrees', 'fraction', 'g / kg',
'g/cc', 'g/cm^3', 'g/kg', 'g/m^2', 'g/m²', 'gC m-2 d-1', 'gigagrams',
'gpm', 'grass, mm', 'hPa', 'ha', 'hours', 'hours/sq. km', 'index',
'kPa', 'kg / m3', 'kg kg-1', 'kg m**-2', 'kg m**-3', 'kg m-2',
'kg m-2 s-1', 'kg m-3', 'kg*C/m^2', 'kg*C/m^2/16-day', 'kg*C/m^2/8-day',
'kg/(m^2)', 'kg/(m^2*s)', 'kg/(m^2/s)', 'kg/(m^3)', 'kg/kg', 'kg/m/s',
'kg/m^2', 'kg/m^2/8day', 'kg/m^2/s', 'kg/m^2/s^1', 'kg/m^2s', 'kg/m^3',
'km', 'km^2', 'm', 'm of water equivalent', 'm s-1', 'm/s', 'm3/m3',
'mW cm-2 µm-1 sr-1', 'm^2', 'm^2/m^2', 'm^3 m-3', 'meq/100g',
'meter/year', 'mg m-3', 'mg/m^3', 'millibars',
'min. into half hour', 'minutes', 'minutes/meter', 'mm',
'mm d-1', 'mm, daily total', 'mm/day', 'mm/hr', 'mm/pentad',
'mol mol-1', 'mol/mol',
'mol/m^2', 'ms', 'nanoWatts/cm2/sr',
'occurrence',
'ppbV', 'ppm', 'psu', 'radians', 'seconds', 'sq. meter/sq. meter',
'sr-1', 'ug m-3',
'°C',
'μm',
'mW cm-2 μm-1 sr-1',
'Quality Flag',
'm^2/m^3',
'1.0e15 molec cm-2',
'molec cm-2 s-1',
'm^2 s-2',
'Number of people/ha',
'm/s^2',
'm^3/m^3',
'J/m^2',
'MJ m^-2 day^-1',
Given the solutions presented in various comments (create a vocab, use the fact that unknown keywords are collected as annotations), I'm not sure what you're asking from JSON Schema.
What are you expecting JSON Schema to do with a unit
keyword?
What are you expecting JSON Schema to do with a unit keyword?
To define it in the spec with possible pointers to other documents/specs that standardize unit names. In many regards, unit
would be similar to format
where the keyword is defined but the values are not strictly defined and extendable. The SI base units could be mentioned specifically.
To define it in the spec with possible pointers to other documents/specs that standardize unit names.
This would be perfect for a vocabulary.
A vocabulary is basically just a spec. It can have any level of formality, from a spec-like format (e.g. my data
vocab) to a blog post or wiki page.
Have a look at this page on vocabularies in Understanding JSON Schema. It actually uses units
as an example.
The steps to create a custom vocabulary are:
units
as a keyword and defines what it means. This document should also define a vocabulary URI that can be used later.units
. For example, if you only want the strings you list above, just put those values in an enum
in the vocabulary meta-schema. This vocabulary meta-schema would also need a URI (different from the vocabulary URI) that will be used later.true
to the `$vocabulary keyword.allOf
with $ref
s to the 2020-12 meta-schema URI and your meta-schema URI (from step 2)To see all this laid out with examples, please see my library's vocabulary documentation.
- Then you would create a custom meta-schema for the vocabulary that defines syntactic requirements of
units
. For example, if you only want the strings you list above, just put those values in anenum
in the vocabulary meta-schema. This vocabulary meta-schema would also need a URI (different from the vocabulary URI) that will be used later.
What if one wanted to create some kind of composable string format for units, matching the style of (or some subset of) Unicode CLDR?
To put the problem differently, if I wanted to validate compiler target triplets, could I define my vocabulary so this arbitrary string format (where there is a limited amount of slots, each with a limited number of possible values) would be somehow encoded in the machine-readable document, or would the format of that string be completely up to human interpretation? I'm sure a very large enum might be an option, but that sounds inefficient, and I'm wondering if JSON Schema doesn't have anything like Typescript's composable literals for this sort of thing.
Sorry if this is an obvious question, I'm still trying to grasp how vocabularies fit into the bigger picture and what role they fill, and the linked json-everything documentation page is only partly clearing things up.
What if one wanted to create some kind of composable string format for units, matching the style of (or some subset of) Unicode CLDR?
You can do this if you create a custom units
keyword, but validation of the content of that keyword could only be done by the meta-schema if you use a regex (pattern
keyword). I imagine that regex would be pretty nasty.
If you really want to get into the weeds, you can create your own variant of the pattern
keyword (e.g. cldr
) that validates your composable strings with a second new vocab and use that in your meta-schema's meta-schema. Then you would use that as a validation of units
, but then you're creating a validation keyword (cldr
) and an annotation keyword (units
), each in their own vocabs, each with their own meta-schemas, not to mention the extensions you'd need to code for the implementations you're using.
Can it be done? Probably, yeah. But it's not a simple exercise. JSON Schema isn't designed to validate string composition.
I think the better bet is to accept the value of units
as an annotation and just validate its value in code. Either way, you're still writing the code to perform the validation (whether in application code or implementation extension code). I think the only reason to go through the exercise of building the vocabularies and meta-schemas is so that it can be reused.
Is it possible to add an optional
units
to the schemas? For scientific data, that can be critical information to declare for a field and the schema seems a reasonable place to state the expectation for the value.Doesn't have to be more than the string. e.g. Some languages bring conversion infrastructure along, which seems out of scope.
Related, but a lot of these are beyond scope of a schema: