w3c / sparql-dev

SPARQL dev Community Group
https://w3c.github.io/sparql-dev/
Other
121 stars 19 forks source link

LINDT units of measure #129

Open VladimirAlexiev opened 3 years ago

VladimirAlexiev commented 3 years ago

Why?

It's hard to work with quantities (value + UoM) in RDF and SPARQL.

There are about 10 UoM ontologies:

Working with units in SPARQL is quite hard. Comparing compatible units or doing arithmetics on units is possible if you are working with one of the better ontologies, but difficult. You have to fetch the dimension vectors and conversion factors and work with them, and the queries become very complex.

SHACL's modest arithmetic capabilities (eg minInclusive to compare to constant, lessThan to compare two props) borrow from SPARQL, so it's impossible to state "temperature should be between 0 and 10 degC", see https://lists.w3.org/Archives/Public/public-shacl/2020Nov/0001.html

But there is one approach that solves these problems.

Previous work

LINDT is unique in that it encodes both value and unit in one literal, eg "1 m"^^cdt:ucum, "100 cm"^^cdt:ucum. This is economical, but more importantly you can compare such quantities, and you can also do arithmetic operations on quantities.

This would be very useful for any sort of application in engineering, smart cities, semantic sensor networks, WoT, etc.

Features https://ci.mines-stetienne.fr/lindt/v2/custom_datatypes.html#on-apache-jena

LINDT is very ingenious and it's a pity that it hasn't found a wider following.

Proposed solution

Adopt LINDT as a best practice for representing units. Work with other communities (WoT, semantic sensors) to also adopt it.

Considerations for backward compatibility

No direct consequences because it uses custom datatype handlers to do its work. I.e. if you don't use the CDT datatypes (cdt:ucum, cdt:length, etc) you'll see no difference.

However, guidance and solution templates for migrating from other systems for representing units should be provided

ericprud commented 3 years ago

@kasei , that makes sense to me. I think your example converts 1ft and 1in both to meters and then back to inches. If you knew the types (e.g. they weren't plucked from some heterogeneous attribute in the data), you could avoid that by narrowing the scope of the cast:

ucum:in("1"^^ucum:ft) + "1"^^ucum:in => "13"^^ucum:in

I mention this because an alternative would be that casting functions override the type promotion of their arguments but applying the cast to each of the contained atoms. This sounds terribly contrived but useful enough to have a moment of collective consideration.

dr-shorthair commented 3 years ago

All the libraries or catalogues that I've looked at record conversion factors to SI. So any comparison of non-SI scaled quantities would necessarily trip through a conversion to SI.

ericprud commented 3 years ago

i think we can drop this notion of having a clairvoyant cast function that operates over the operands of any nested operators.

VladimirAlexiev commented 3 years ago

@ericprud and @sa-bpelakh

Several people have proposed to use per-kind units like "1"^^ucum:m instead of LIND's approach eg "1 m"^^cdt:ucum. But nobody has yet proposed how to handle the variety of "crazy" units.

dr-shorthair commented 3 years ago

I understand your concern about URL escapes. The following 'reserved' characters may appear in UCUM codes:

* ' ( ) + / [ ]

Of these, [ ] are commonly used and can't be easily worked around. ' appear in the codes for minutes and seconds, and in some qualified units like [in_i'H2O]. Parentheses ( ) can be used to group codes 'under the solidus' /, both of which can be avoided by using dots and negative exponents. + is only necessary for some power-of-ten factors.

I don't believe { } are reserved.

I think QUDT is a separate issue at this point. Yes, it may be useful as it provides an RDF-based model for describing units. But I would expect that it would be invoked through a call like give me the QUDT description of the UOM with the UCUM symbol AAAaaaAAA or similar. The UCUM symbol is the key.

HolgerKnublauch commented 3 years ago

Vladimir, I still don't have strong opinion against string-encoding. And I do agree that this flexible string encoding has some advantages, because it is more open-ended than having URIs and, as you point out, URL escapes can be ugly.

I do wonder though whether those complex compound units are important enough and whether they should dictate how the rest of the solution should work. Arguably the vast majority of use cases will be covered by a static set of predictable and well-established URIs for the commonly used units. Much will be gained if there is at least a solution for those. As long as there is a generic machinery to get from a Unit URI to the base units, conversion factors etc, even a URI mechanism would cover the more unusual cases. If units are URIs then these resources can hold additional metadata for this effect.

maximelefrancois86 commented 3 years ago

Dear all,

The BIPM (Bureau International des Poids et des Mesures - the intergovernmental organization through which Member States act together on matters related to measurement science and measurement standards) is organizing an on-line workshop Feb. 22-26 2021: The International System of Units (SI) in FAIR digital data

https://www.bipm.org/en/conference-centre/bipm-workshops/digital-si/

See a Draft - Grand Vision: Transforming the International System of Units for a Digital World

I was invited to present there. I aim to summarize the different approaches that have been discussed in the W3C groups I was involved, and other approaches I am aware of in the SemWeb community, with the identified pros/cons

You are welcome to attend this workshop too, the pre-registration form is here: https://form.jotform.com/BIPM/Workshop-SI-2021

VladimirAlexiev commented 3 years ago

@HolgerKnublauch and @maximelefrancois86 and @dr-shorthair I think we need both belt and suspenders:

BTW I'm now dealing with IEC and eClass units

dr-shorthair commented 3 years ago

QUDT has links to some of these

Note that QUDT is now quite responsive to requests and bug reports, information supplementation etc. Log an issue here - https://github.com/qudt/qudt-public-repo/issues Better still: fork and make a PR.

maximelefrancois86 commented 3 years ago

As a matter of fact, theoretically in https://ucum.org/ucum.html

From 2.1§3■1 UCUM atom characters are in the ASCII range 33-126, minus a few characters. The following UCUM atom characters are forbidden in IRIs: <>|^`\ or need to be escaped in IRI local names: ~!$&'*,;?#@%_

From 2.1§6■1 UCUM characters for annotation { } are forbidden characters for IRIs

From 2.1§7■1 characters for operators . / need to be escaped in IRI local names

So encoding UCUM units in datatype IRIs, one would end up:

I understand your concern about URL escapes. The following 'reserved' characters may appear in UCUM codes:

* ' ( ) + / [ ]

Of these, [ ] are commonly used and can't be easily worked around. ' appear in the codes for minutes and seconds, and in some qualified units like [in_i'H2O]. Parentheses ( ) can be used to group codes 'under the solidus' /, both of which can be avoided by using dots and negative exponents. + is only necessary for some power-of-ten factors.

I don't believe { } are reserved.

nichtich commented 1 month ago

I can assure that real world RDF data with units of measure happens to be given in at least these forms (with varying namespaces and ontologies):

@prefix cdt: <https://w3id.org/cdt/>
@prefix om: <http://www.ontology-of-units-of-measure.org/resource/om-2/>

# 1. Custom plain string without any reference (most common, good luck)
_:x my:weight "10 KiloGram" . 

# 2. Reference to a standard notation such as UCUM (better)
_:x my:weight "10 kg"^^cdt:ucum .

# 3. Value and data type from some standard vocabulary, e.g. OM (UCUM in RDF)
_:x my:weight "10"^om:kilogram . 

# 4. Measurement node with some custom or standard vocabulary
_:x my:weight [
  my:value 10 ;
  my:unit om:kilogram
]

The last form has several variants. Here is an actual example from practice using CRM ontology (slightly simplified, it's even more complex!):

@prefix crm: <http://www.cidoc-crm.org/cidoc-crm/> .

_:m crm:P39 measured _:x .
_:m [
  a crm:E16_Measurement ;
  crm:P40_observed_dimension [
    a crm:E54_Dimension ;
    crm:P90_has_value: 2.8 ;
    crm:P91_has_unit [ # this would map to an existing unit URI such as om:centimetre
      a crm:E58_Measurement_Unit ;
      crm:P3_has_note "cm"  
  ] ;
  crm:P2_has_type [ # this would need to another vocabulary with definition of "height"
     crm:P3_has_note "Höhe"  
  ]
]

To handle and clean up this ways to model data with units of measure I'd stick to:

  1. a standard to write down measures in string form and a corresponding RDF data type: UCUM and cdt:ucum looks good!

  2. URIs for units of measurement such as kg, cm...: some have already been proposed and people will not stop creating new URIs and their own ontologies and lists for units with their own use cases. Any approach to collect all units in one single ontology is futile.

  3. An ontology to link units of measurement, e.g. to state that a unit my:RomanMile is 5000 times another unit my:RomanFeet. SPARQL does not need to know about actual units, just about how to process their conversion factors.

dr-shorthair commented 1 month ago

IMAO we should encourage use of pattern 3. as it provides the required information in the most usable form

  1. Value and data type from some standard vocabulary, e.g. OM (UCUM in RDF) _:x my:weight "10"^^om:kilogram .

Unlike patterns 1. and 2. this does not use a microformat in which a literal must be parsed and broken up into multiple items. Pattern 3. can be processed by un-modified and unsupplemented RDF libraries.

And unlike pattern 4. it does not bury a scalar inside a data structure.

Yes, pattern 3. hands off interpretation of the scale to another service, but all the proposed options appear to do that anyway.

kasei commented 1 month ago

IMAO we should encourage use of pattern 3. as it provides the required information in the most usable form

I think "most usable" is going to be use-case dependent here. The CRM modeling is the way it is for reasons important to cultural heritage use-cases. The very verbose modeling here stems mostly from using an upper ontology that can be used to address diverse use-cases (e.g. the units and/or type of value such as "weight of 10kg" are not fixed or prescribed by the ontology), and allows metadata to be added to almost any part of the data (e.g. provenance data that preserves the exact lexical form of the value that might differ from a normalized numeric value; or adding a citation to exactly where a dimension value came from). FWIW, RDF 1.2 (RDF-star) may provide some new options to address these modeling needs.

Additionally, the CRM modeling has the advantage that it actually uses numeric values that will sort naturally in SPARQL (and use optimized storage and retrieval in many systems) without any runtime casting or conversion. Encouraging best practices can be good, but to maintain these benefits you'd have to go beyond best practices and ensure LINDT datatypes were officially supported by SPARQL and underlying stores.

TallTed commented 1 month ago

Additionally, the CRM modeling has the advantage that it actually uses numeric values that will sort naturally in SPARQL (and use optimized storage and retrieval in many systems) without any runtime casting or conversion.

Of course, numeric values for mass of 1 kg, mass of 0.997 kg, and mass of 999 g, all of which are valid, will not sort as desired, unless all mass values are converted (or forced) to kg or g.

(quibble: 10kg is a measure of mass, not weight, and is the same for the same object whether it's measured on Earth or the Moon. 10lbs is a measure of weight, not mass, and differs for the same object depending on whether it's measured on Earth or the Moon.)

ericprud commented 1 month ago
  • UCUM defines a countably infinite list of units. Any RDF approach is necessarily finite.

I don't know that it does have to be finite. What happens if we take UCUM verbatim and simply accept that there can be an infinite expression of datatypes just as there can be an infinite expression of values that they describe.

As a thought experiment, a more self-describing "datatype namespace" could define something like (borrowing from @TallTed's quibble):

# for some reason lbf is tied to Avoirdupois. whatever
"10"^^kind_n_type:massXdistanceYtimeYtime_lbf-av
kasei commented 1 month ago

Of course, numeric values for mass of 1 kg, mass of 0.997 kg, and mass of 999 g, all of which are valid, will not sort as desired, unless all massvalues are converted (or forced) tokgorg`.

Right. In the CIDOC case, you'd likely be restricting the query to a specific unit in the graph pattern, or be casting values with arbitrary units to a known unit via SPARQL extension function (or client-side, which has it's own set of challenges). I think that's somewhat orthogonal to the storage-level advantages of having real numeric types, but again this might be use-case dependent. FWIW, I think the Wikidata modeling has some similarities here, in that you can restrict to known units in the graph pattern by using the psn predicates for normalized values, and then on to a real quantityAmount numeric value.

nichtich commented 1 month ago

@kasei thanks for mentioning Wikidata. Its model of units of measures is documented with SPARQL queries here. The list of supported quantities is configured in a table but this table could be given in RDF with a (hopefully more simple) subset of QUDT Units Vocabulary.