VladimirAlexiev commented 4 years ago

Why?

It's hard to work with quantities (value + UoM) in RDF and SPARQL.

There are about 10 UoM ontologies:

the worse of them are just a list of units
the better of them also add dimensionality analysis (eg length is L^1, area is L^2) and conversion factors (eg from cm to m, from degF to degC)

Working with units in SPARQL is quite hard. Comparing compatible units or doing arithmetics on units is possible if you are working with one of the better ontologies, but difficult. You have to fetch the dimension vectors and conversion factors and work with them, and the queries become very complex.

SHACL's modest arithmetic capabilities (eg minInclusive to compare to constant, lessThan to compare two props) borrow from SPARQL, so it's impossible to state "temperature should be between 0 and 10 degC", see https://lists.w3.org/Archives/Public/public-shacl/2020Nov/0001.html

But there is one approach that solves these problems.

Previous work

The Unified Code for Units of Measure (UCUM) http://unitsofmeasure.org/ucum.html codifies all kinds of units and guarantees unambiguous interpretation (no unit label conflicts).
The Java UCUM library implements this system
Linked Data Types (LINDT) uses this and SPARQL datatype handlers to implement it in SPARQL

LINDT is unique in that it encodes both value and unit in one literal, eg "1 m"^^cdt:ucum, "100 cm"^^cdt:ucum. This is economical, but more importantly you can compare such quantities, and you can also do arithmetic operations on quantities.

home: https://ci.mines-stetienne.fr/lindt/index.html
playground: https://ci.mines-stetienne.fr/lindt/playground.html
spec: https://ci.mines-stetienne.fr/lindt/v2/custom_datatypes.html, https://ci.mines-stetienne.fr/lindt/spec.html
implemented in a Jena fork: https://github.com/thesmartenergy/jena, https://github.com/OpenSensingCity/jena-ucum

This would be very useful for any sort of application in engineering, smart cities, semantic sensor networks, WoT, etc.

@maximelefrancois86 tells me it even supports complex numbers, which are important in some electricity applications, is that correct? Can you give an example?

Features https://ci.mines-stetienne.fr/lindt/v2/custom_datatypes.html#on-apache-jena

Overload of SPARQL operators (=, <, etc.) to compare measurement literals;
Overload of algebraic functions (+, -, *, /) to manipulate measurement literals:
- Add two commensurable measurement literals
- Subtract a measurement literals to a commensurable one
- Multiply two measurement literals, or a measurement literal and a scalar (xsd:int, xsd:decimal, xsd:float, xsd:double)
- Divide a measurement literal by a measurement literal, a measurement literal by a scalar, or a scalar by a measurement literal
Custom SPARQL function lindt:sameDimension(arg1,arg2) to check if two measurement literals are commensurable (returns a xsd:boolean).
Cast to XSD numeric datatypes
dynamic loading of new datatypes/units

LINDT is very ingenious and it's a pity that it hasn't found a wider following.

It's implemented as a Jena branch but hasn't been merged into trunk: "This branch is 14 commits ahead, 5325 commits behind apache:master."
We have been looking for a pretext (i.e. client) to implement it in rdf4j.
It's adopted in some ontologies, but these are very few:
- CoCoOn: Cloud Computing Ontology for IaaS Price and Performance Comparison
- VSSo: Vehicle Signal and Attribute Ontology
- others?

Proposed solution

Adopt LINDT as a best practice for representing units. Work with other communities (WoT, semantic sensors) to also adopt it.

Considerations for backward compatibility

No direct consequences because it uses custom datatype handlers to do its work. I.e. if you don't use the CDT datatypes (cdt:ucum, cdt:length, etc) you'll see no difference.

However, guidance and solution templates for migrating from other systems for representing units should be provided

JervenBolleman commented 4 years ago

I have been thinking about this exact thing. However, I was thinking that all of UCUM might be to much work to demand, plus currently awkward licensing wise (section 7 of the license). So I was wondering if the 7 base Units of the International Systems of Units plus the 21 coherent derived named units might be a sufficient lower bound for implementation.

unit	proposed datatype	example of idea
second	unit:s	60^^unit:s
meter	unit:m	1.99^^unit:m
kilogram	unit:kg	88^^unit:kg
Ampere	unit:A
Kelvin	unit:K	273.1^^unit:K
mol	unit:mol
candelad	unit:J
hertz	unit:Hz
radian	unit:rad
steradian	unit:sr
newton	unit:N
pascal	unit:Pa
joule	unit:J
watt	unit:W
coulomb	unit:C
volt	unit:V
farad	unit:F
ohm	unit:Ω
siemens	unit:S
weber	unit:Wb
tesla	unit:T
henry	unit:H
lumen	unit:lm
lux	unit:lx
becquerel	unit:Bq
gray	unit:Gy
sievert	unit:Sv
katal	unit:kat

Implementation advantage is these are simple numeric values so as long as datatype is the same cast to decimal and compare.

There are no prefixes, all must be converted to the base unit. e.g. 600km should be stored as "60000"^^unit:m. Derived units must always be in coherent form for this to work (i.e. also not have prefixes). Advantage of scaling down to base units is simpler comparison functions, and easier to generate indexes.

The list misses Celsius, which is trivial to add, but would need to be comparable to Kelvin. Also ohm symbol Ω is outside of ascii so unit:Ohm could be an option.

Basically, the cost is in converting to base units at storage. However, consistent storage makes it easier to build indexes and ensure query results are correct.

This of course leaves out all the derived units e.g. kg/m m/s^2 etc. which I think would be very worth while to have but might be to much work to implement all commonly used ones. Unless they follow a straightforward pattern in IRI encoding that can be decoded and generated by stores on the fly. e.g. an option division "1"^^unit:kg / "1"^^unit:m => "1"^^unit:kg-per-m and "2"^^unit:m * "2"^^unit:m => "4"^^unit:m2

Also the use of data types avoids one of the issues with UCUM is that conflicts in coding exists, and that coding for customary units is not straightforward. See Fluid Ounce. Where a datatype pointing to the fluid ounce definition would be a clearer option. Specifically as there are more legal redefinitions of fluid ounce for legal reasons. It's 30 ml or 23 1/3 grams of pure alcohol in some US food standards (TODO: find again an article showing the many redefinitions of US fluid ounce). UCUM is widely specified in clinical settings but in reality not always used (even where it was specified).

Side notes

all unit values should inherit from xsd:decimal.
we should have power and square root operators

VladimirAlexiev commented 4 years ago

@JervenBolleman what licensing problems do you see?

UCUM has customary units that are important in many disciplines. If you ask users to always store angstroms and light-years as meters, you are shifting burden to them.

And even "decorative" units like 123 {rbc} which is a count (dimensionless) but of "red blood cells".

LINDT is implemented in Jena, and @jeenbroekstra says won't be too hard to port to rdf4j. So "Downgrading" to your proposal will be more work for java devs.

So that leaves other languages. The suggestion is whether there are UCUM libraries in other languages?

HolgerKnublauch commented 4 years ago

I much prefer using explicit datatypes to encoding the unit into the string. E.g. "1"^^unit:m is better than "1 m"^^ucum:unit.

But a more general solution might be to offer a declarative extension point so that anyone can define custom datatypes and those datatypes actually can be used consistently. This could work similar to user-defined SHACL constraint components or SHACL-AF functions. Just some very quick thoughts, a datatype might need to be able to respond to questions "can I compare my value to another datatype" (e.g. yes for mm to m comparison), and then a normalize function that would bring all datatypes from a group to a common base unit, e.g. meter. Then things like < comparison in SPARQL can be automated. The actual business logic can probably be covered declaratively through a couple of properties that are attached to the units as done (comprehensively) in the QUDT vocabulary.

The advantage here is that a SPARQL 1.2 would only need to implement a few generic building blocks while the details of the specific datatypes are irrelevant, and we don't even need to discuss the specific catalog of datatypes that need to be implemented.

VladimirAlexiev commented 4 years ago

Connecting UCUM unit symbols to ontologies like QUDT or OM is important and useful because they expose as triples info that is within the UCUM library (eg the dimension vector if Newton and the conversion factors of Farenheit). Plus extra info, eg that Inch is an imperial unit, grouping of units by discipline, etc. I believe QUDT or OM already has ucum codes, so that should not be hard.

So we could spec custom functions to parse out a unit from a quantity, and connect to a structured unit node in such an ontology.

VladimirAlexiev commented 4 years ago

@HolgerKnublauch

"1"^^unit:m is better than "1 m"^^ucum:unit

In addition to the generic cdt:ucum, LINDT also has datatypes ucum:length, ucum:mass etc that represent quantities with fixed/known dimension.

But I see some problems with having distinct datatypes for each unit:

there are just too many. It's nearly a combinatorial explosion. Eg "barrels per day", "US barrels per hour", etc etc.
- Consider that just for dimensionless units, there are many variations such as percent, promile, ppm (parts per million...).
- There are also annotations (advisory customary pieces), eg {rbc} (red blood cells), {pair} or {pairs} for socks, {packs} vs {masterboxes} for cigarettes, s {0..100 km/h} for car acceleration expressed as time to reach that speed, etc: see https://ucum.org/ucum.html#para-6
units use special symbols that will be unwieldy in URL local names or will become unreadable if you URL-encode them. Eg what datatype URLs would you translate the following units to? (they happen to express the same unit):
```
"km.h-1"^^cdt:ucumunit
"km/h"^^cdt:ucumunit
"(1000m)/(60min)"^^cdt:ucumunit
```

offer a declarative extension point

LINDT does that:

see https://ci.mines-stetienne.fr/lindt/spec.html and in particular https://ci.mines-stetienne.fr/lindt/spec.html#the-application-programming-interface
see https://ci.mines-stetienne.fr/lindt/v1/custom_datatypes.ttl for a declaration of a datatype cdt:length, and https://ci.mines-stetienne.fr/lindt/v1/custom_datatypes.js for an implementation in JS

BTW @maximelefrancois86 there are Broken links at https://ci.mines-stetienne.fr/lindt/v1/custom_datatypes:

http://w3id.org/lindt/custom_datatypes.ttl redirects to https://ci.mines-stetienne.fr/lindt/v3/custom_datatypes.ttl, which does not exist
http://w3id.org/lindt/custom_datatypes.js redirects to https://ci.mines-stetienne.fr/lindt/v3/custom_datatypes.js, which does not exist

In contrast, both of https://ci.mines-stetienne.fr/lindt/v1/custom_datatypes and https://ci.mines-stetienne.fr/lindt/v3/custom_datatypes exist.

namedgraph commented 4 years ago

"(1000m)/(60min)"^^cdt:ucumunit -- this is simply not a structured way of describing units, which goes against the RDF practice.

VladimirAlexiev commented 4 years ago

@namedgraph This unit is well structured and well defined according to https://ucum.org/ucum.html (and implemented in Java UCUM and consequently in LINDT). It's just not structured in RDF.

NOT everything needs or should be structured in RDF. Eg are you against GeoSPARQL literals (WKT and GML)?

maximelefrancois86 commented 4 years ago

Dear all,

@namedgraph , there are sometimes good rationale to encode complex values using literals instead of relying on RDF structures and basic datatypes. The OGC GeoSPARQL datatype geo:WKTLiteral is a great example.

"<http://www.opengis.net/def/crs/OGC/1.3/CRS84> Polygon((-83.6 34.1, -83.6 34.5, -83.2 34.5, -83.2 34.1, -83.6 34.1))”^^geo:WKTLiteral

I back up the thoughts of @VladimirAlexiev :

I think having a unique datatype cdt:ucum would be the most simple choice. SPARQL engines would only need to recognise one additional datatype IRI, and could hand on to the UCUM specification for the list of base units, and how compound units can be formed. There exist implementations of UCUM in common programming languages.
In our implementation on apache Jena, we also included:
- overload of SPARQL operators (=, <, etc.) to compare measurement literals;
- overload of algebraic function (+, -, *, /) to manipulate measurement literals:
- a custom SPARQL function with IRI: http://w3id.org/lindt/custom_datatypes#sameDimension(arg1, arg2) to check if two measurement literals are commensurable (returns a xsd:boolean).
- cast to XSD numeric datatypes

namedgraph commented 4 years ago

OK fine, WKT literals is a counter-example. But GeoSPARQL is an additional standard, not part of the SPARQL spec. And you seem to have done the same with units. So what's the problem? Why does it need to be in SPARQL 1.2 proper?

maximelefrancois86 commented 4 years ago

I am neutral on this. I also think it would be fine to have such a datatype specified in a separate document.

VladimirAlexiev commented 4 years ago

I don't see this becoming part of SPARQL 1.2. As I said "Adopt LINDT as a best practice" and work with other communities to adopt it.

@maximelefrancois86

can you give an example of complex numbers used for electrical quantities?
see "Broken links" above

maximelefrancois86 commented 4 years ago

Thank you @VladimirAlexiev , we are on the same page.

For complex numbers: we just had a very good first year Master student that worked on this during her 3-months internship this year: Yana Soares de Paula. https://www.linkedin.com/in/yanaspaula/ She did an excellent job in just three months, but more work would be needed to augment cdt:ucum with complex numbers. She would probably happy to share her report with you if you wish

About the broken links in lindt v1, I'll create an issue and check asap. Thanks for the notice.

VladimirAlexiev commented 4 years ago

There exist implementations of UCUM in common programming languages

Eg there seem to be 2 for JS:

It appears UCUM is the dominant UoM system in life sciences.

https://ucum.nlm.nih.gov/ says "UCUM has been adopted internationally by many organizations such as IEEE, DICOM, LOINC, and HL7, and is also in the ISO 11240:2012 standard". (and in addition to the JS implementation has more resources like validation and autocomplete_
https://www.hl7.org/fhir/ucum.html
https://danielvreeman.com/units-of-measure-conversion-validation/

maximelefrancois86 commented 4 years ago

See also

Python https://pypi.org/project/pyucum/
Java https://github.com/unitsofmeasurement/uom-systems/
Java https://github.com/FHIR/Ucum-java
C# https://github.com/mnisl/OD
Rust https://github.com/agrian-inc/wise_units

Maybe there are more

dr-shorthair commented 4 years ago

Connecting UCUM unit symbols to ontologies like QUDT

@VladimirAlexiev Most QUDT units now have UCUM codes in their description, so correlation of these two systems is already available. (I did this work in the last few months.) The ones that are missing do not have equivalent UCUM codes, so there is nothing on that side to correlate with.

QUDT gives you explicit dimension-vectors, and conversion factors (and offsets, where appropriate).

I'm also on the UCUM Advisory Board, and the licensing issue is high on the agenda. Though the current UCUM Terms of Use look a bit fierce at first glance, I have been assured that the kind of usage that is envisaged here is totally fine, and the intention is to make this more clear in the license.

dr-shorthair commented 4 years ago

On the matter of style: I vote with @HolgerKnublauch in favour of

273^^ucum:K

compared with

"273 K"^^cdt:ucum

It does not require a string to be parsed, so basic SPARQL queries can be used, detecting the datatype, but without regexing strings.

dr-shorthair commented 4 years ago

@JervenBolleman I think your table matches the UCUM codes, except for Ohm for Ω (note case). That is no surprise as UCUM was designed to use the common codes as far as possible.

This XML representation is the reference for the UCUM terminals.

sa-bpelakh commented 4 years ago

It does seem like the best way to adhere to standard and industry-wide use but avoid mixing domain-specific aspects into the generic SPARQL standard, this should be an auxiliary standard, like GeoSPARQL. The complex cdt:ucum literals are quite similar to WKT in this aspect.

dr-shorthair commented 4 years ago

Scaling factors for scalar quantities are not domain specific at all.

In fact I'd argue it is a notable failure of almost all computer languages that this is not built-in. There are very few pure 'floating point' numbers, or 'decimals' that can be understood without knowing the unit-of-measure.

I'm totally fine with embedding coordinate sequences in a microformat, since they have no meaning considered independently. I was in the team that standardized GeoSPARQL and am very comfortable with the design choice. But scalar quantities are a very different matter, and much more simple.

HolgerKnublauch commented 4 years ago

I also think that units are a different topic than GeoSPARQL. Users should expect to perform comparisons using the built-in < and > operators, and possibly to do arithmetics such as + and * on unit'ed (is this a word?) values. This might of course just become a matter of enough implementations agreeing on a de-facto standard, but it shouldn't be too hard to agree on a mechanism at least for the most common units in a SPARQL 1.2. Once it's in SPARQL then related standards such as SHACL would automatically "inherit" these features, e.g. for sh:minInclusive.

JervenBolleman commented 4 years ago

@dr-shorthair I expanded my comment, changed to unit:Ohm, whose casing is inconsistent over standards.

@HolgerKnublauch I also think easier support by stores for custom datatypes would be very nice. And would make implementing this feature cheaper for everyone. Let's open a separate issue for easier custom datatypes. (Also easier sharing of custom function definitions).

@VladimirAlexiev I would love a full UCUM support for some projects I am involved in. I am just worried that it would be to large a code base for independent smaller SPARQL communities to implement. Also I think we end up with a downstream licensing issue with UCUM until their license is changed. Which might take a long time.

kasei commented 4 years ago

@JervenBolleman I could see standardization of service description vocabulary terms for describing which custom datatypes are supported. Beyond that, though, wouldn't "easier custom datatypes" be an issue for individual implementations (and not something the spec can/should concern itself with)? What would spec involvement in this area look like?

VladimirAlexiev commented 4 years ago

@dr-shorthair where do you have mapping tables QUDT-UCUM showing in particular the gaps on either side?

As I wrote above, it's useful to have in RDF (QUDT) what UCUM libraries provide in code.

Please comment on how you would represent the variety of UCUM strings (including annotations in curlies) as datatype URLs. You picked the easiest case K.

@sa-bpelakh Agreed! As I said above, this can only be a recommended best practice, can't be part of the SPARQL spec.

@HolgerKnublauch

perform comparisons using the built-in < and > operators, and possibly to do arithmetics such as + and * on unit'ed (is this a word?)

For comparison, + and - you need Commensurate quantities (having same dimensionality). You can apply * and / to any quantities, and also between quantities and simple numbers.

LINDT does all that.

sh:minInclusive

Yes! And sh:lessThan

@JervenBolleman

too large a code base for independent smaller SPARQL communities to implement

UCUM has implementations in many languages, they should leverage such implementations. LINDT uses UCUM Java and hooks up into Jena datatype handlers to override SPARQL operators.

kasei commented 4 years ago

too large a code base for independent smaller SPARQL communities to implement

UCUM has implementations in many languages, they should leverage such implementations. LINDT uses UCUM Java and hooks up into Jena datatype handlers to override SPARQL operators.

As an implementor of several SPARQL systems in less popular languages, I join @JervenBolleman in concern at the implementation burden. Just because something like this has implementations in several languages does not mean there wouldn't be a real cost added to many existing (and possibly future!) systems.

ashleysommer commented 4 years ago

I'm a maintainer of the Python RDFLib (including its SPARQL executor) and developer of PySHACL.

I agree with @VladimirAlexiev on this one. After reading the UCUM Spec I don't see how individual 273^^ucum:K could work for all of the possible combinations of units of measurement allowed by UCUM.

You'd need a string representation like "273 K"^^cdt:ucum, a simple example is 10 millimeters of mercury (for pressure measurement) "10 mm[Hg]"^^cdt:ucum and for a more extreme example "ventricular stroke work" in "gramforce-meter per heartbeat per square meter" "4 gf.m/({hb}.m2)"^^cdt:ucum.

While the set of units defined in UCUM is closed, the microformat is created in such a way that adding new units (in a subsequent version) is easy and predictable. If every unit in the current spec was pulled out into a discrete datatype in the ucum ontology, that would need to be updated whenever a new unit is added to UCUM.

VladimirAlexiev commented 4 years ago

Here is the offending license clause:

Subject to Section 1 and the other restrictions hereof, users may incorporate portions of the UCUM table and definitions into another master term dictionary (e.g. laboratory test definition database), or software program for distribution outside of the user's corporation or organization, provided that any such master term dictionary or software program includes the following fields reproduced in their entirety from the UCUM table: UCUM code, definition value and unit. Every copy of the UCUM table incorporated into or distributed in conjunction with another database or software program must include the following notice:

“This product includes all or a portion of the UCUM table, UCUM codes, and UCUM definitions or is derived from it, subject to a license from Regenstrief Institute, Inc. and The UCUM Organization. Your use of the UCUM table, UCUM codes, UCUM definitions also is subject to this license, a copy of which is available at http://unisofmeasure.org. The current complete UCUM table, UCUM Specification are available for download at http://unitsofmeasure.org. The UCUM table and UCUM codes are copyright © 1995-2013, Regenstrief Institute, Inc. and the Unified Codes for Units of Measures (UCUM) Organization. All rights reserved.

THE UCUM TABLE (IN ALL FORMATS), UCUM DEFINITIONS, AND SPECIFICATION ARE PROVIDED "AS IS." ANY EXPRESS OR IMPLIED WARRANTIES ARE DISCLAIMED, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE.”

If the master term dictionary or software program containing the UCUM table, UCUM definitions and/or UCUM specification is distributed with a printed license, this statement must appear in the printed license. Where the master term dictionary or software program containing the UCUM table, UCUM definitions, and/or UCUM specification is distributed on a fixed storage medium, a text file containing this information also must be stored on the storage medium in a file called "UCUM_short_license.txt". Where the master term dictionary or software program containing the UCUM table, UCUM definitions, and/or UCUM specification is distributed via the Internet, this information must be accessible on the same Internet page from which the product is available for download.

HOWEVER, see the comment above that the UCUM committee says it's ok to use UCUM as described in this issue. I.e. not to get too hung up on this legalese (which is indeed pretty bad, compared to modern open licences)

dr-shorthair commented 4 years ago

@VladimirAlexiev

where do you have mapping tables QUDT-UCUM showing in particular the gaps on either side?

get https://github.com/qudt/qudt-public-repo/blob/master/vocab/unit/VOCAB_QUDT-UNITS-ALL-v2.1.ttl

Run

SELECT *
WHERE {
?p a qudt:Unit .
MINUS { ?p a qudt:CurrencyUnit . }
OPTIONAL { ?p qudt:ucumCode ?u . }
}

I don't think there are any gaps on the QUDT side relative to the UCUM terminals, but since UCUM does not define a closed set (novel combinations are always possible) there will not be a member in the QUDT catalogue for every arbitrary UCUM code.

dr-shorthair commented 4 years ago

Regarding the license, remember that (notwithstanding the use of XML for the reference data) UCUM was developed in a pre-linked-data, and pre-CC world. And since UCUM is widely built in to medical and clinical software, there was a concern to ensure that there not be any muddling of libraries and conversions. That would be a real problem. As I am in contact with both the QUDT and UCUM maintainers, clarifying the license is a priority to clear up any issues about the appearance of UCUM codes alongside a separately derived set of conversion factors. But this should definitely NOT impede mentioning UCUM codes in RDF and SPARQL.

Note that the US National Library of Medicine is now providing the main support for UCUM, including an API and a Javascript library.

dr-shorthair commented 4 years ago

@ashleysommer

I agree with @VladimirAlexiev on this one. After reading the UCUM Spec I don't see how individual 273^^ucum:K could work for all of the possible combinations of units of measurement allowed by UCUM.

Indeed, we do need to think this through.

For the pattern that @HolgerKnublauch and I advocate, the ucum:XXXX datatype reference must implicitly point to a member of an (unbounded) set of codes, composed of every possible 'legal' combination of the UCUM terminals. I don't think I've seen a datatype reference used in that way before.

Also, I haven't checked what the possible escaping implications of the UCUM grammar are for a token appearing in that position in an RDF graph. / . ( ) [ ] { } ' " - + are all used in UCUM codes. / ( ) can mostly be avoided by using the option of negative exponents and dots, but there might be difficulty with some of the others.

JervenBolleman commented 4 years ago

I thought it useful, to describe my gut feeling regarding ucum in the literals. In RDF we have always gone for things not strings. Here I feel we are regressing somewhat, knowing already that there are known confusions in UCUM strings, I am personally hesitant to recommend it in a linked data setting. Specifically, the way that UCUM specifies behaviour towards annotations {} is ripe for issues in implementations/comparisons (if I read wrong let me know).

dr-shorthair commented 4 years ago

The messiness around {annotations} is certainly a thing.

However, in my opinion the primary challenge is that the actual gamut of units-of-measure used in practice is effectively infinite. And that is not a bug, it is a feature. So any approach to units of measure has to address that.

The good news is that it is not intractable. You can define an algorithm to build the symbol, using a finite set of atoms (hundreds, but not thousands), so that it looks exactly like the common symbol for common units. This is all laid out in ISO 80000 and adapted for the ASCII keyboard by UCUM. There is an unambiguous mapping from any UCUM code to its semantics, though the codes for many derived units are not unique (to take a simple case, metres-per-second can be written either m/s or m.s-1).

I understand your concern about 'literals vs things', but I fear that it follows from the actual requirement - which I believe is to have a general way of representing the scale (i.e. unit of measure) for any quantity in RDF. If you can see another solution that satisfies the requirement and also meets the reasonable expectations of people who would be typing or selecting these things (e.g. scientists and engineers can already read or write m/s or m.s-1), then that would be great.

HolgerKnublauch commented 4 years ago

Just as input here: Assuming we would use units as datatypes and not encoded in strings. It becomes possible to use rdfs:range, owl:allValuesFrom or sh:datatype statements, e.g.

ex:Thing-width a sh:PropertyShape ; sh:path ex:width ; sh:datatype ucum:m .

ex:SomeThing ex:width "42"^^ucum:m .

versus

ex:SomeThing ex:width "42 m"^^ucum:unit .

The former appears less redundant, e.g. on user input widgets - why repeat yourself. But more importantly, someone would need to invent new ways of stating what kinds of values are permitted a values, e.g. a new SHACL constraint component for the kind of unit (length vs temperature, for example). This might still be needed in cases where users can choose between multiple units, but not every use case is like that.

If there is a well-defined mapping from unit variations to strings then I guess these strings can also be used for the local name of the datatype URIs. Once the datatypes have URIs, it arguably becomes also easier to attach additional machine-processable information to them such as the conversion factor to the base unit. With strings all this becomes rather hidden in non-declarative algorithms.

dr-shorthair commented 4 years ago

@ashleysommer

If every unit in the current spec was pulled out into a discrete datatype in the ucum ontology, that would need to be updated whenever a new unit is added to UCUM.

I don't think it is feasible to generate and maintain static semantic representations of the unit denoted by every UCUM ever cited.

However, it is possible to dynamically (a) verify that a code is valid UCUM, and (b) give the conversion multiplier to the SI unit with the same dimension vector. The services at NLM do the checking and conversion, and QUDT can provide the dimension vector and the SI units (with a bit of help from SPARQL).

VladimirAlexiev commented 4 years ago

@JervenBolleman

known confusions in UCUM strings

Please give some pointers or examples

behaviour towards annotations {} is ripe for issues in implementations/comparisons

Annotations in particular are easy to handle: they don't change the meaning so you just discard them. And they cannot be nested.

More importantly, handling UCUM (parsing, conversions, etc) is a solved problem in major languages.

@HolgerKnublauch

ex:SomeThing ex:width "42 m"^^ucum:unit

LINDT includes the basic dimensional units eg cdt:length, and there's a function to check quantity conformance.

JervenBolleman commented 4 years ago

@ashleysommer @dr-shorthair I don't even think we would ever need to generate a file with all possible combinations. As long as we can make them linked data (there is of course the prior art of an infinite set of numbers being available as linked open data ;) ).

There is a potential for a grammar to generate the datatype IRI's. The '{}' brackets are currently not escapable (not part of the set PN_LOCAL_ESC).

So today we can do, in valid sparql 1.1.

SELECT *
WHERE
   { ?measurement1 ex:value "1"^^unit:kg\/mass
     ?measurement2 ex:value "2"^^unit:\% .
 }

We can't do

SELECT *
WHERE  { 
   ?measurement1 ex:value "1"^^unit:kg\/mass\{person\} .
   ?measurement2 ex:value "2"^^unit:\%\{vol\} .
 }

The {} bracket not in the escape set is probably because they are not allowed in IRI's per rfc3987 however, HTML5 whatwg URL specification allows them. For SPARQL I feel we can follow HTML5

JervenBolleman commented 4 years ago

@VladimirAlexiev See UCUM itself

UCUM annotations are easily discarded, but annotations do have semantics. {H.B.}/min is not the same as {drops}/min and just discarding is risky in a clinical setting.

Lot's of things are solved in existing languages with known libraries. But in my opinion, that does not mean they should be copy pasted into the SPARQL ecosystem. But don't let my opinion stop you from writing a SEP.

nicholascar commented 4 years ago

the microformat is created in such a way that adding new units (in a subsequent version) is easy and predictable

microformats in RDF === bad

I'm struggling with microformats for WKT etc. in GeoSPARQL 1.1. Please void microformats and use the RDF main methods of dealing with the meaning of data.

A user of RDF/SPARQL shouldn't have to resort to a micro format parser to understand any of the content in databases / queries, only graph walking and Linked Data dereferencing.

VladimirAlexiev commented 4 years ago

@nicholascar There are many successful microformats in RDF:

XSD datatypes, in particular dates, times...
language tags
strings (it's just a list of chars, isn't it?)
wktLiteral and gmlLiteral. It's no coincidence that all repos supporting region algebras use those formats. It's infeasible to implement region algebras except by special indexing of such literals. I'm surprised I need to give such example to you of all people!
- I'm struggling with microformats for WKT etc. in GeoSPARQL 1.1: What are you struggling with, more specifically?
Turtle's rdf:List
Turtle predicate and object lists
Turtle's blank nodes (brackets)
numbers (you can implement them with the S,K combinators or lists, can't you?)

Just because you can represent many things in RDF doesn't mean you should. IMHO you should represent in RDF only "high-value data":

things that link to other things
things that you want to search by (and I don't
NOT things that you need to do special computing with

Normal people may say:

:said "WTF?"@zh-Hant-Latn-pynin;
:on "2020-11-10T12:05:01"^^xsd:dateTime;
:repeat "100 1/s{frequency}"^^cdt:ucum

Moderate RDF dogmatists that "only" want to replace the first 2 bullets above may have to say:

:said [
  a :Sentence;
  :kind :interrogative
  :value [:chars ("W" "T" "F")]
  :language [
    a :LanguageSpecification;
    :lang iana_lang:zh;
    :dialect iana_lang:Hant;
    :script iana_lang:Latn;
    :latinization iana_lang:pynin
  ]];
:on [
  a :DateTime;
  :year 2020;
  :month 11;
  :day 10;
  :hour 12;
  :minutes 05;
  :seconds 01];
:repeat [
  a :Quantity;
  :value [
    a :Integer;
    :literal "100"];
  :unit ucum:1\/s\{frequency\}];

Just two more examples:

Smart people (the creators of XSD) added zeros in front of years so you can compare dates from 0001 to 9999 as strings, even if you don't have a working date/time comparison
Dogmatic people evict all XSD date & time datatypes from OWL (except xsd:dateTimeStamp which is totally comparable), replacing them with home-baked versions. And add owl:rational, which has no lexical representation.

Ontotext is a repository vendor so I have a vested interest to deal with more and more triples. Nevertheless, I believe that dogmatic approaches harm RDF: adding more triples and complexity turns people away from RDF.

There is https://github.com/w3c/EasierRDF ... but what you are arguing here is for harder RDF.

nicholascar commented 4 years ago

@VladimirAlexiev your comments overstate both my interests and understate my understanding. I don't mean we should break everything down into triples and I have argued for years to preserve some microformats, or non-RDF formats, for certain purposes. I would not suggest turning all literals into chained nodes.

The issues in GeoSPARQL 1.1 are about whether Coordinate Reference Systems need to be encoded in microformats, like GML/WKT, or should be broken out into other properties within Geometry elements.

I do think that units of measure are one of the scenarios that should be represented with RDF and not microformats. Conversion vectors, notation, etc. all fine to be microformats for UoM but the core ideas of numerical quantities, scales and so on should be in RDF. The reasoning is that we are talking about multiple orthogonal concerns within units, scales and so on of measure, and to be able to work out whether measurements across datasets are commensurate, i.e. can be used together properly, we need to be able to join on one or more of these orthogonal dimensions of measurements. If we don't call out those dimensions as first class things in RDF, we can't easily match them with graph pattern matching like SPARQL.

SELECT ?measurement
WHERE {
  ?measurement 
    ex:hasUnit <some-unit> ;
    ex:ofSomeProperty <some-property> ;
    ex:hasNumericalValue ?value .

  FILTER (?value > "x" && ?value < "y")
}

Multiple dimensions being mapped above and there could be others. UoM ontologies think all this through...

Regarding GeoSPARQL, I do think that Coordinate Reference Systems should be recorded outside coordinate literals (i.e. a property to indicate WGS84 rather than "crs:wgs84 POLYGON (...)" but for historical reasons and for the whole non-RDF world, we already have this CRS recording taking place within the literal so we have to continue to) handle it in future GeoSPARQLs.

But for literals in domains for which there isn't strong, and widely accepted, microformats, we shouldn't build them in to RDF/SPARQL. UoM is such a domain: there really isn't a universally accepted set of microformats for them like there is for, say WTK.

VladimirAlexiev commented 4 years ago

@nicholascar The motivation for having the CRS in wktLiteral is that in this way it's self-describing. A literal that uses easting/northing is quite different from a literal that uses latitude/longitude.

However, I agree with you that keeping the CRS as a separate prop is better. (I thought you're arguing to break the literal down to points).

work out whether measurements across datasets are commensurate

But it's easy to connect UCUM and QUDT!

Split the UCUM quantity into decimal part and unit
- Or we could make it even easier by making the space mandatory, then split on first space
Use the QUDT field "UCUM code" by @dr-shorthair to lookup the QUDT unit
- This won't work for computed units like (1000m/60min) unless someone adds the common amongst them to QUDT as "UCUM alternative codes"
- You'd also have to strip the {...} annotations
- But it will work for the majority of common UCUM codes, which are maybe 10x more than as listed above by @JervenBolleman
Use the structured RDF unit data to your heart's content (for dimensionality analysis, exposing/linking datasets, etc)

We can pack this to a SPARQL function lindt:qudtUnit(cdt:ucum). Why should you have to do such lookup? Because it's a much easier task than doing quantity comparisons and arithmetics!

FILTER (?value > "x" && ?value < "y")

Ah, but here is the key! What would you write for x and y to conform to <some-unit> that you fetched dynamically? Or how would you compare two commensurate quantities that you fetched dynamically?

Please note that the makers of QUDT (@HolgerKnublauch et al) agree the unit should go together with the literal. They only argue it should be in a datatype not in the string (eg "1"^^ucum:m instead of "1m"^^cdt:ucum or the more concrete "1m"^^cdt:length).

This would work just as well for comparison and arithmetic on common units (m, Hz,
- However, it doesn't work well for units that use special chars like mm[Hg] (pressure) or km/h (speed)
- And it doesn't work at all for annotated units or computed units, because UCUM defines an infinite number of units but QUDT (or any RDF) can only have a finite subset
To find the QUDT unit of a quantity, you'd also use a SPARQL function (albeit a standard one):
- bind(datatype(?quantity) as ?dt). ?qudtUnit qudt:datatype ?dt
- which means the fetch cannot use a SPO index (I don't know of any repos to index by datatype).
- This is fine for mapping quantity->QUDT
- It will be too slow for "find me all quantities expressed in meters", but I don't think this is a reasonable query, just like "find me all wktLiterals that have "3.14" as the latitude of their fourth corner"

UoM ontologies think all this through

Except comparisons and arithmetics

UoM is such a domain: there really isn't a universally accepted set of microformats

NLM says you're wrong, see https://github.com/w3c/sparql-12/issues/129#issuecomment-722440526 and eg https://specifications.openehr.org/releases/RM/Release-1.0.3/data_types.html#_dv_quantity_class

It is ironic that there isn't a universally accepted way to express the two parts of a quantity in RDF, as you illustrate above:

ex:hasUnit <some-unit> ; ex:hasNumericalValue ?value .

https://schema.org/QuantitativeValue and https://schema.org/PropertyValue specify something like that, but they are a bit of a mess of compromises:

almost the same but not quite
same unit must be used for minValue, maxValue, value
unitCode doesn't say use URLs but a mess of UNECE Rec2 and CURIE like strings like "qudt:m"

VladimirAlexiev commented 4 years ago

@JervenBolleman

UCUM annotations are easily discarded, but annotations do have semantics. {H.B.}/min is not the same as {drops}/min and just discarding is risky in a clinical setting.

Agreed! They have their meaning, but no computational semantics. The two above are comparable to Hz (with a conversion factor of 1/60), but that doesn't mean it is meaningful to compare them.

To quote from https://ucum.org/ucum.html#para-6

Annotations do not contribute to the semantics of the unit but are meaningless by definition. Therefore, any fully conformant parser must discard all annotations. Curly braces are here because people want annotations and deeply believe that they need annotations. Especially in chemistry and biomedical sciences, there are traditional habits to write annotations at units or instead of units, such as “%vol.”, “RBC”, “CFU”, “kg(wet tis.)”, or “mL(total)”. These habits are hard to overcome. Any attempt of a coding scheme to restrict this percieved expressiveness will ultimately result in the coding scheme not being adopted, or just “half-way” adopted (which is as bad as not adopted). Two alternative responses to this reality exist: either give in to the bad habits and blow up of the code with dimension- and meaningless unit atoms, or canalize this habit so that it does no harm. The Unified Code for Units of Measure canalizes this habit using curly braces.

@HolgerKnublauch and @nicholascar Please heed the bold text in prev paragraph. As @dr-shorthair and I explained, UCUM describes infinitely many units.

Please don't try to straight-jacket them to finitely many RDF units.
Please take the word of practitioners (eg in life sciences) who've dealt with more actual units in their life than we RDF people have.
(Nicholas, I know you've dealt with plenty of units in geoscience and environmental science, but I think softer sciences like healthcare and clinical science are more complex in that regard)
Or you can add some sort of grammar or constructs to RDF to generate all variations... As @dr-shorthair put it "the datatype reference must implicitly point to a member of an (unbounded) set of codes, composed of every possible 'legal' combination of the UCUM terminals". How would this work?

VladimirAlexiev commented 4 years ago

@kasei I found only this for Perl: http://perl.overmeer.net/geo/html/jump.cgi?Geo_EOP&375 . It's part of http://perl.overmeer.net/geo-eop/source/ (2015-07). http://perl.overmeer.net/geo/#versions claims it's released on CPAN but I can't find it. And it only supports angle, distance.

See https://github.com/lhncbc/ucum-lhc/tree/master/data for implementation resources, in particular ucumDefs.json and ucum-essence.xml

dr-shorthair commented 4 years ago

I suspect we are in almost violent agreement here folks!

There are a few genuine issues, but none that I would die in a ditch over. For example, I could live with either style of encoding. I currently prefer 273.1^^ucum:K because even I can write a simple SPARQL 1.1 query to find all the values with a specific unit, or to find the unit (expressed as a URI) for any value. But if a SPARQL "1.2" function was available to parse strings like "273.1 K" for me, and return the numeric part and the unit, then I could certainly live with that too.

(Best of all if there was also a function to convert scaled values to and from SI values.)

If you got the impression @nicholascar thought the gamut of units was finite, then that is just a miscommunication or misreading at some point. Nick knows a lot more than that. But as you detected, I like UCUM a lot because it supports all the combinations explicitly. And it is the product of long experience in a huge field. I dislike QUDT because the maintainers (currently) store way too many static representations, and I look forward to when it is a dynamic system. I very much like QUDT because it has explicit dimension vectors, and support for (semantic) QuantityKinds. Neither satisfies all requirements.

VladimirAlexiev commented 4 years ago

@JervenBolleman re https://ucum.org/ucum.html#section-Summary-of-Conflicts

I checked the first two:	conflict unit	metric	non-metric
Pa	Pascal - pressure	Peta-annum (peta-years)
Gb	Gilbert – magnetic tension	Giga-barn (action area)

These non-metric units are extremely unlikely, and as it says

there is only a conflict if the metric predicate is violated so that non-metric units are used with a prefix

I checked at https://ucum.nlm.nih.gov/ucum-lhc/demo.html, and this JS library resolves to the metric unit.

I think these few conflicts are not a serious concern.

JervenBolleman commented 4 years ago

@VladimirAlexiev there are more conflicts when using annotations (and that is my clinical experience) same annotation meaning different things in different systems. These are serious concern for my day job.

Also UCUM is limited in reach, e.g. missing indian survey feet and many more units known outside of the US. Which is why I think the unit should be a datatype and not encoded in the literal. As we can always mint our own datatypes and be assured of no collisions.

The constraints that UCUM operates under (needs to be a case insensitive string in ASCII 7bit) are IMHO so tight as to make some options very difficult.

Requiring units to be fully computed for comparison to work is common in all other literals. e.g. we don't support FILTER("2+2"^^xsd:integer = 4) and that is natural. I don't see why we should support FILTER("(1000m)/(60min)"^^ucum:... = "1km/hr"^^ucum:... = "16.6666666667"^^ucum:m\/s).

This becomes important when dealing with historical data where measurements have been redefined but the same unit name has been used. This is outside the clinical world but should also be considered as this specification is about more than just clinical use.

I am not convinced that the temporary relief by including UCUM/LINDT into encoded literals for sparql engines is the way to go. I think we can get much more value from different approaches to custom datatypes and shareable functions.

Experience with xsd:Durations show that micro formats are a real cost to implementers and xsd:Duration is much better supported in the wild across language ecosystems than UCUM is.

dr-shorthair commented 4 years ago

Which is why I think the unit should be a datatype and not encoded in the literal.

Yes - this is a strong argument. It is more immediately extensible.

The constraints that UCUM operates under (needs to be a case insensitive string in ASCII 7bit)

There is a case-sensitive option. I used the case-sensitive version when I added UCUM codes to QUDT.

VladimirAlexiev commented 4 years ago

@JervenBolleman

I looked at one of the Java implementations https://github.com/unitsofmeasurement/uom-systems/ and they mention more UoM systems (although I could not find a "Unicode CLDR Unit System").

Also, in issue https://github.com/unitsofmeasurement/uom-systems/issues/156 they state "UCUM development seems to have stalled since 2017".

Maybe we should read JSR 385. Here's its use cases section. I'm not sure whether it specifies particular UoM systems to support.

found it here: https://unitsofmeasurement.github.io/pages/references.html
JavaDocs https://unitsofmeasurement.github.io/unit-api/site/apidocs/overview-summary.html

we don't support FILTER("2+2"^^xsd:integer = 4) and that is natural

But we support FILTER("2"^^xsd:integer + 2 = 4). What is unnatural is that we don't support "1 m"^^cdt:ucum + "100 cm"^^cdt:ucum or "1"^^ucum:m + "100"^^ucum:cm.

same annotation meaning different things in different systems. These are serious concern for my day job.

Then work on standardizing annotations. At least UCUM has made space for them, which no other UoM ontology or system has done.

Jerven and @HolgerKnublauch, do we all agree that we must have overloaded operators to handle comparisons of comparable units, and arithmetics (between all kinds of units, and with numbers)?

I'd be just as happy if that's implemented with special literals "1 m"^^cdt:ucum or with special datatypes "1"^^ucum:m, provided those special datatypes can be written in a reasonable way
But you still haven't proposed a good approach of how to map the vast variety of UCUM units into datatypes, especially those including punctuation. Your mapping table above https://github.com/w3c/sparql-12/issues/129#issuecomment-721215957 is just not enough
Extensibility is an important argument, indeed. But do you have a design whereas adding "indian feet" with its conversion factor in RDF will automatically allow me to use it in comparisons and arithmetics? Will the SPARQL extension functions (operator overriding) read from the local repo?

ericprud commented 4 years ago

@VladimirAlexiev

@JervenBolleman

we don't support FILTER("2+2"^^xsd:integer = 4) and that is natural

But we support FILTER("2"^^xsd:integer + 2 = 4). What is unnatural is that we don't support "1 m"^^cdt:ucum + "100 cm"^^cdt:ucum or "1"^^ucum:m + "100"^^ucum:cm.

I think a more apt analogy for "1"^^ucum:m + "100"^^ucum:cm would be FILTER(2 + 2.0 = 4). The fact that "2"^^xsd:integer parses to the same internal representation as 2 is just an feature of the parser semantics. The ability to add a double and an integer and compare the result to an integer (in fact, the comparison substitutes the double 4.0) is orchestrated by XPath's numeric type promotion and type substitution. Extrapolating that to apply to units would give us that same functionality and some nice unit analysis as a side benefit. I can see a couple ways to do that:

Canonical units

For every dimension we specify (length, charge, mass...), pick a canonical unit. MKS would be practical and would add another attractor tugging the US forward to the 18th century). Enumerate all of the compatible units with linear functions mapping them to the canonical: ucum:m -> +0, 1 ucum:m ucum:in -> +0, .0254 ucum:m ucum:f -> -32, *1.8 ucum:c

Any evaluation requiring the promotion of the left column to the right column applies the transformation and leaves you with the canonical units. Where the current operator table has entries like

Operator	Type(A)	Type(B)	Function	Result type
A + B	numeric	numeric	op:numeric-add(A, B)	numeric

we could add entries for the dimensions: Operator	Type(A)	Type(B)	Function	Result type
A + B	length	length	op:numeric-add(A, B)	length

This is cool because the operator table prevents us from adding a length to a time. It's a little funny because everything gets metrified, e.g. (BIND "1"^^ucum:ft + "1"^^ucum:in AS ?x) will give you ".3302"^^ucum:m.

Unit ladder

We could ameliorate that a bit by group entries in the type promotion hierarchy so that known imperial units stay imperial and get promoted to the smallest imperial unit, so (BIND "1"^^ucum:ft + "1"^^ucum:in AS ?x) will give you "13"^^ucum:in. Things that don't fit into one of those groups would still get metrified (yes, i made that word up), e.g. (BIND "1"^^ucum:lightyear + "1"^^ucum:parsec AS ?x) will give you "4.0318165349E16"^^ucum:m.

P.S.

It would be lovely extend the grammar so we could write 1ft instead of "1"^^ucum:foot (which as a parser feature, is orthogonal to the "1"^^ucum:foot vs. "1ft"^^ucum:length debate. I guess feasibility comes down to how crazy the lexical strings for the units are.

sa-bpelakh commented 4 years ago

@VladimirAlexiev

Canonical units

I like the design for canonical units, and the implementation is well defined. I definitely prefer "1"^^ucum:foot instead of "1ft"^^ucum:length, because the unit implies the dimension, and avoids a micro-grammar in the literal value.

I think the complexity of the unit ladder could be avoided if you allow casting conversions, e.g. bind(ucum:foot(?a + ?b +?c) as ?length_in_feet)) to guarantee a specific unit (and do dimension checking in the process)

kasei commented 4 years ago

@ericprud

It's a little funny because everything gets metrified, e.g. (BIND "1"^^ucum:ft + "1"^^ucum:in AS ?x) will give you ".3302"^^ucum:m.

I would think this could be handled just like the XPath constructor functions:

ucum:in("1"^^ucum:ft + "1"^^ucum:in) => "13"^^ucum:in

(Though there might be some funny floating point error issues to consider.)

w3c / sparql-dev

LINDT units of measure #129

Why?

Previous work

Proposed solution

Considerations for backward compatibility

Canonical units

Unit ladder

P.S.

Canonical units