json-schema-org / json-schema-vocabularies

Experimental vocabularies under consideration for standardization
54 stars 10 forks source link

Units for json schemas #46

Open schwehr opened 3 years ago

schwehr commented 3 years ago

Is it possible to add an optional units to the schemas? For scientific data, that can be critical information to declare for a field and the schema seems a reasonable place to state the expectation for the value.

Doesn't have to be more than the string. e.g. Some languages bring conversion infrastructure along, which seems out of scope.

Related, but a lot of these are beyond scope of a schema:

karenetheridge commented 3 years ago

Anything is possible. Are you picturing a keyword that does validation in some way, or is purely annotative?

schwehr commented 3 years ago

At the schema level, I am only thinking annotative. I think it's up to an application to decide how to use the unit annotation should it exist. The most common case is likely with annotating visualizations of data with the units. Perhaps the user could turn to something like udunits to do a conversion. Unit conversion and range checking are two heavy weight tasks that don't seem to be reasonable within the spec of the schema.

At the edge of reasonable would be to give a list of accepted strings. But that could be large and likely users in diverse fields will have units that are not in the list. e.g. scipy units doesn't have magnetic units found at IEEE Magnetics

This is from our custom schema type that will hopefully become json-schema based. The user's code could check the schema on a collection type and at halt with an error if speed isn't what it expects. e.g. If a program only handles m/s, but it finds knots.

      "max_wind_kts": {
        "description": "Maximum wind speed",
        "type": "DOUBLE",
        "unit": "knots"
      },
      "min_pressure": {
        "description": "Minimum pressure",
        "type": "DOUBLE",
        "unit": "millibars"
      },
      "numEntries": {
        "description": "Number of points for a particular hurricane",
        "type": "DOUBLE"
      },
      "radii_ne_34kt": {
        "description": "34 kt wind radii maximum extent in northeastern quadrant",
        "type": "DOUBLE",
        "unit": "nautical miles"
      },
handrews commented 3 years ago

@schwehr This would make for an excellent extension vocabulary, now that those are supported :-) It's a common problem, as you note, but probably not quite fundamental enough to go into the main spec.

Also, if you want types like DOUBLE you're going to need a new keyword for that as well. The type keyword is not extensible, and given how heavily used it is and how stable it has ben for many years, the bar for changing it is very high.

gregsdennis commented 3 years ago

I think number is what you're looking for instead of DOUBLE. Though if you're looking to declare precision, perhaps a precision keyword would be better suited.

{
  "type": "number",
  "precision": "double"
}

This would also live in your extension vocab.

Ark-kun commented 3 years ago

Don't type and format provide enough flexibility for specifying the units?

A bit of a tangent, by JsonSchema's naming of type and format seems counter-intuitive to me. They'd be less confusing if they were reversed.

There are millions of types in the world. Yet the number of [serialization] formats is smaller. And if we talk about JSON, there are only a handful of ways something can be represented in JSON: string, number, boolean, array, object.

I think it would have been more intuitive to reverse the naming:

{
    "type": "Nautical miles",
    "format": "number"
}
{
    "type": "Positive integer",
    "format": "number"
}
{
    "type": "E-mail address",
    "format": "string"
}
{
    "type": "JPEG image",
    "format": "blob"
}
handrews commented 3 years ago

@Ark-kun type correlates to the JavaScript (the JS in JSON) notion of types. format has never been implemented consistently across implementations and should be replaced by coherent vocabularies of keywords.

handrews commented 2 years ago

I'm going to move this over to the vocabularies repo.

silverwind commented 1 year ago

Being able to define a unit for time periods would be very useful:

uptime:
  type: integer
  unit: microseconds
  example: 50

Some possibly useful spec text here.

sm-Fifteen commented 1 year ago

Like I said in https://github.com/OAI/OpenAPI-Specification/issues/2061#issuecomment-993795559, Unicode CLDR and ECMAScript Intl both define the same composable string format as well as a set of base units and prefixes to allow some form of "universal" unit specification like this. Both only do so with the main use case of text internationalisation and display unit conversion, but co-opting this into a JSON Schema vocabulary would be enough to cover the vast majority of use cases around data structure documentation as well.

As for unusual/domain-specific units like decibel-milliwatts or tons of TNT or parts-per-octet "percentages" and whatnot, the CLDR notation allows for private-use units starting with xxx- (formerly x-), allowing users to also define custom units where the stock ones aren't sufficent.

gregsdennis commented 1 year ago

This is still an annotative-only concept. We have some ideas cooking that may help in defining annotative keywords, but it's on hold until we sort out a few other things.

As it stands, there's nothing preventing someone from just creating (and publishing) a new vocabulary that defines the units keyword, its valid values, and its intended use. The JSON Schema framework is already in place to support such a vocabulary.

It's good to see there's continued interest in this, though.

karenetheridge commented 1 year ago

As it stands, there's nothing preventing someone from just creating (and publishing) a new vocabulary that defines the units keyword, its valid values, and its intended use. The JSON Schema framework is already in place to support such a vocabulary.

It's not even necessary to do that much, for a purely-annotative keyword, because the latest version of JSON Schema specifies that unknown keywords should be collected as annotations by default. If you're using json schemas in the context of OpenAPI 3.1.x, you may need to define a custom metaschema that allows for unknown keywords, however, but this is a simple matter of creating the data file, and no extra code is needed (nor additional support in implementations, as long as they are specification-conformant).

silverwind commented 1 year ago

More standardization would be nice tought, so that for example a UI can render 9124 seconds as 2 hours, 31 minutes. Deep integration like that can only happen with some standardization across implementations.

schwehr commented 1 year ago

Thanks all for the continued discussion. Units is still a serious pain point for me. This is my current list of units in in the system I work on.

    '%', '% (kg / kg)', '(kg/m^3)/(m/s)',
    '-',  # Meaning dimensionless?
    'Alfalfa, mm', 'Class',
    'Coefficient of Variation', 'DN',
    'DU', 'Day',
    'degree', 'Degree', 'Degrees', 'Degrees clockwise from North',
    'Dimensionless',  # Meaning no units?
    # https://en.wikipedia.org/wiki/Dobson_unit
    'Dobson units', 'Dobsons', 'dobsons',
    'Equivalent gauges per 2.5 degree box',
    'Hours',
    'J/kg', 'J/m2', 'J/m^2/day', 'Julian Day',
    'K', 'Kelvin',
    'MW', 'Megawatts',
    'Mg C/ha', 'Mg ha^-1', 'Mg/ha',
    'Minutes', 'N/m^2', 'NFDRS fire danger index', 'Number of upstream pixels',
    'Number per pixel', 'Pa', 'Pa/s', 'Percent', 'percent', 'Pixels', 'pixels', 'Reflectance factor',
    'Seconds', 'W m**-2', 'W m-2',
    'W m^-2 sr^-1 μm^-1',
    'W/(m^2*sr*um)/ DN',
    'W/m^2',
    'W/m^2 SRµm',
    'W/m^2 SRμm',
    'cm', 'cmol(+)/kg', 'cms', 'count', 'counts/day',
    'dB', 'days', 'deg true', 'degree C', 'degrees', 'fraction', 'g / kg',
    'g/cc', 'g/cm^3', 'g/kg', 'g/m^2', 'g/m²', 'gC m-2 d-1', 'gigagrams',
    'gpm', 'grass, mm', 'hPa', 'ha', 'hours', 'hours/sq. km', 'index',
    'kPa', 'kg / m3', 'kg kg-1', 'kg m**-2', 'kg m**-3', 'kg m-2',
    'kg m-2 s-1', 'kg m-3', 'kg*C/m^2', 'kg*C/m^2/16-day', 'kg*C/m^2/8-day',
    'kg/(m^2)', 'kg/(m^2*s)', 'kg/(m^2/s)', 'kg/(m^3)', 'kg/kg', 'kg/m/s',
    'kg/m^2', 'kg/m^2/8day', 'kg/m^2/s', 'kg/m^2/s^1', 'kg/m^2s', 'kg/m^3',
    'km', 'km^2', 'm', 'm of water equivalent', 'm s-1', 'm/s', 'm3/m3',
    'mW cm-2 µm-1 sr-1', 'm^2', 'm^2/m^2', 'm^3 m-3', 'meq/100g',
    'meter/year', 'mg m-3', 'mg/m^3', 'millibars',
    'min. into half hour', 'minutes', 'minutes/meter', 'mm',
    'mm d-1', 'mm, daily total', 'mm/day', 'mm/hr', 'mm/pentad',
    'mol mol-1', 'mol/mol',
    'mol/m^2', 'ms', 'nanoWatts/cm2/sr',
    'occurrence',
    'ppbV', 'ppm', 'psu', 'radians', 'seconds', 'sq. meter/sq. meter',
    'sr-1', 'ug m-3',
    '°C',
    'μm',
    'mW cm-2 μm-1 sr-1',
    'Quality Flag',
    'm^2/m^3',
    '1.0e15 molec cm-2',
    'molec cm-2 s-1',
    'm^2 s-2',
    'Number of people/ha', 
    'm/s^2', 
    'm^3/m^3', 
    'J/m^2', 
    'MJ m^-2 day^-1', 
gregsdennis commented 1 year ago

Given the solutions presented in various comments (create a vocab, use the fact that unknown keywords are collected as annotations), I'm not sure what you're asking from JSON Schema.

What are you expecting JSON Schema to do with a unit keyword?

silverwind commented 1 year ago

What are you expecting JSON Schema to do with a unit keyword?

To define it in the spec with possible pointers to other documents/specs that standardize unit names. In many regards, unit would be similar to format where the keyword is defined but the values are not strictly defined and extendable. The SI base units could be mentioned specifically.

gregsdennis commented 1 year ago

To define it in the spec with possible pointers to other documents/specs that standardize unit names.

This would be perfect for a vocabulary.

A vocabulary is basically just a spec. It can have any level of formality, from a spec-like format (e.g. my data vocab) to a blog post or wiki page.

gregsdennis commented 1 year ago

Have a look at this page on vocabularies in Understanding JSON Schema. It actually uses units as an example.

The steps to create a custom vocabulary are:

  1. Create a document for your vocabulary that declares units as a keyword and defines what it means. This document should also define a vocabulary URI that can be used later.
  2. Then you would create a custom meta-schema for the vocabulary that defines syntactic requirements of units. For example, if you only want the strings you list above, just put those values in an enum in the vocabulary meta-schema. This vocabulary meta-schema would also need a URI (different from the vocabulary URI) that will be used later.
  3. Lastly create a new meta-schema that incorporates the 2020-12 meta-schema and your new meta-schema. I suggest copying the 2020-12 meta-schema and:
    • adding your vocabulary URI (from step 1) with a value of true to the `$vocabulary keyword.
    • replacing the contents of the allOf with $refs to the 2020-12 meta-schema URI and your meta-schema URI (from step 2)

To see all this laid out with examples, please see my library's vocabulary documentation.

sm-Fifteen commented 1 year ago
  • Then you would create a custom meta-schema for the vocabulary that defines syntactic requirements of units. For example, if you only want the strings you list above, just put those values in an enum in the vocabulary meta-schema. This vocabulary meta-schema would also need a URI (different from the vocabulary URI) that will be used later.

What if one wanted to create some kind of composable string format for units, matching the style of (or some subset of) Unicode CLDR?

To put the problem differently, if I wanted to validate compiler target triplets, could I define my vocabulary so this arbitrary string format (where there is a limited amount of slots, each with a limited number of possible values) would be somehow encoded in the machine-readable document, or would the format of that string be completely up to human interpretation? I'm sure a very large enum might be an option, but that sounds inefficient, and I'm wondering if JSON Schema doesn't have anything like Typescript's composable literals for this sort of thing.

Sorry if this is an obvious question, I'm still trying to grasp how vocabularies fit into the bigger picture and what role they fill, and the linked json-everything documentation page is only partly clearing things up.

gregsdennis commented 1 year ago

What if one wanted to create some kind of composable string format for units, matching the style of (or some subset of) Unicode CLDR?

You can do this if you create a custom units keyword, but validation of the content of that keyword could only be done by the meta-schema if you use a regex (pattern keyword). I imagine that regex would be pretty nasty.

If you really want to get into the weeds, you can create your own variant of the pattern keyword (e.g. cldr) that validates your composable strings with a second new vocab and use that in your meta-schema's meta-schema. Then you would use that as a validation of units, but then you're creating a validation keyword (cldr) and an annotation keyword (units), each in their own vocabs, each with their own meta-schemas, not to mention the extensions you'd need to code for the implementations you're using.

Can it be done? Probably, yeah. But it's not a simple exercise. JSON Schema isn't designed to validate string composition.

I think the better bet is to accept the value of units as an annotation and just validate its value in code. Either way, you're still writing the code to perform the validation (whether in application code or implementation extension code). I think the only reason to go through the exercise of building the vocabularies and meta-schemas is so that it can be reused.