w3c / sparql-dev

SPARQL dev Community Group
https://w3c.github.io/sparql-dev/
Other
124 stars 19 forks source link

LINDT units of measure #129

Open VladimirAlexiev opened 4 years ago

VladimirAlexiev commented 4 years ago

Why?

It's hard to work with quantities (value + UoM) in RDF and SPARQL.

There are about 10 UoM ontologies:

Working with units in SPARQL is quite hard. Comparing compatible units or doing arithmetics on units is possible if you are working with one of the better ontologies, but difficult. You have to fetch the dimension vectors and conversion factors and work with them, and the queries become very complex.

SHACL's modest arithmetic capabilities (eg minInclusive to compare to constant, lessThan to compare two props) borrow from SPARQL, so it's impossible to state "temperature should be between 0 and 10 degC", see https://lists.w3.org/Archives/Public/public-shacl/2020Nov/0001.html

But there is one approach that solves these problems.

Previous work

LINDT is unique in that it encodes both value and unit in one literal, eg "1 m"^^cdt:ucum, "100 cm"^^cdt:ucum. This is economical, but more importantly you can compare such quantities, and you can also do arithmetic operations on quantities.

This would be very useful for any sort of application in engineering, smart cities, semantic sensor networks, WoT, etc.

Features https://ci.mines-stetienne.fr/lindt/v2/custom_datatypes.html#on-apache-jena

LINDT is very ingenious and it's a pity that it hasn't found a wider following.

Proposed solution

Adopt LINDT as a best practice for representing units. Work with other communities (WoT, semantic sensors) to also adopt it.

Considerations for backward compatibility

No direct consequences because it uses custom datatype handlers to do its work. I.e. if you don't use the CDT datatypes (cdt:ucum, cdt:length, etc) you'll see no difference.

However, guidance and solution templates for migrating from other systems for representing units should be provided

JervenBolleman commented 4 years ago

I have been thinking about this exact thing. However, I was thinking that all of UCUM might be to much work to demand, plus currently awkward licensing wise (section 7 of the license). So I was wondering if the 7 base Units of the International Systems of Units plus the 21 coherent derived named units might be a sufficient lower bound for implementation.

unit proposed datatype example of idea
second unit:s 60^^unit:s
meter unit:m 1.99^^unit:m
kilogram unit:kg 88^^unit:kg
Ampere unit:A
Kelvin unit:K 273.1^^unit:K
mol unit:mol
candelad unit:J
hertz unit:Hz
radian unit:rad
steradian unit:sr
newton unit:N
pascal unit:Pa
joule unit:J
watt unit:W
coulomb unit:C
volt unit:V
farad unit:F
ohm unit:Ω
siemens unit:S
weber unit:Wb
tesla unit:T
henry unit:H
lumen unit:lm
lux unit:lx
becquerel unit:Bq
gray unit:Gy
sievert unit:Sv
katal unit:kat

Implementation advantage is these are simple numeric values so as long as datatype is the same cast to decimal and compare.

There are no prefixes, all must be converted to the base unit. e.g. 600km should be stored as "60000"^^unit:m. Derived units must always be in coherent form for this to work (i.e. also not have prefixes). Advantage of scaling down to base units is simpler comparison functions, and easier to generate indexes.

The list misses Celsius, which is trivial to add, but would need to be comparable to Kelvin. Also ohm symbol Ω is outside of ascii so unit:Ohm could be an option.

Basically, the cost is in converting to base units at storage. However, consistent storage makes it easier to build indexes and ensure query results are correct.

This of course leaves out all the derived units e.g. kg/m m/s^2 etc. which I think would be very worth while to have but might be to much work to implement all commonly used ones. Unless they follow a straightforward pattern in IRI encoding that can be decoded and generated by stores on the fly. e.g. an option division "1"^^unit:kg / "1"^^unit:m => "1"^^unit:kg-per-m and "2"^^unit:m * "2"^^unit:m => "4"^^unit:m2

Also the use of data types avoids one of the issues with UCUM is that conflicts in coding exists, and that coding for customary units is not straightforward. See Fluid Ounce. Where a datatype pointing to the fluid ounce definition would be a clearer option. Specifically as there are more legal redefinitions of fluid ounce for legal reasons. It's 30 ml or 23 1/3 grams of pure alcohol in some US food standards (TODO: find again an article showing the many redefinitions of US fluid ounce). UCUM is widely specified in clinical settings but in reality not always used (even where it was specified).

Side notes

VladimirAlexiev commented 4 years ago

@JervenBolleman what licensing problems do you see?

UCUM has customary units that are important in many disciplines. If you ask users to always store angstroms and light-years as meters, you are shifting burden to them.

And even "decorative" units like 123 {rbc} which is a count (dimensionless) but of "red blood cells".

LINDT is implemented in Jena, and @jeenbroekstra says won't be too hard to port to rdf4j. So "Downgrading" to your proposal will be more work for java devs.

So that leaves other languages. The suggestion is whether there are UCUM libraries in other languages?

HolgerKnublauch commented 4 years ago

I much prefer using explicit datatypes to encoding the unit into the string. E.g. "1"^^unit:m is better than "1 m"^^ucum:unit.

But a more general solution might be to offer a declarative extension point so that anyone can define custom datatypes and those datatypes actually can be used consistently. This could work similar to user-defined SHACL constraint components or SHACL-AF functions. Just some very quick thoughts, a datatype might need to be able to respond to questions "can I compare my value to another datatype" (e.g. yes for mm to m comparison), and then a normalize function that would bring all datatypes from a group to a common base unit, e.g. meter. Then things like < comparison in SPARQL can be automated. The actual business logic can probably be covered declaratively through a couple of properties that are attached to the units as done (comprehensively) in the QUDT vocabulary.

The advantage here is that a SPARQL 1.2 would only need to implement a few generic building blocks while the details of the specific datatypes are irrelevant, and we don't even need to discuss the specific catalog of datatypes that need to be implemented.

VladimirAlexiev commented 4 years ago

Connecting UCUM unit symbols to ontologies like QUDT or OM is important and useful because they expose as triples info that is within the UCUM library (eg the dimension vector if Newton and the conversion factors of Farenheit). Plus extra info, eg that Inch is an imperial unit, grouping of units by discipline, etc. I believe QUDT or OM already has ucum codes, so that should not be hard.

So we could spec custom functions to parse out a unit from a quantity, and connect to a structured unit node in such an ontology.

VladimirAlexiev commented 4 years ago

@HolgerKnublauch

"1"^^unit:m is better than "1 m"^^ucum:unit

In addition to the generic cdt:ucum, LINDT also has datatypes ucum:length, ucum:mass etc that represent quantities with fixed/known dimension.

But I see some problems with having distinct datatypes for each unit:

offer a declarative extension point

LINDT does that:


BTW @maximelefrancois86 there are Broken links at https://ci.mines-stetienne.fr/lindt/v1/custom_datatypes:

In contrast, both of https://ci.mines-stetienne.fr/lindt/v1/custom_datatypes and https://ci.mines-stetienne.fr/lindt/v3/custom_datatypes exist.

namedgraph commented 4 years ago

"(1000m)/(60min)"^^cdt:ucumunit -- this is simply not a structured way of describing units, which goes against the RDF practice.

VladimirAlexiev commented 4 years ago

@namedgraph This unit is well structured and well defined according to https://ucum.org/ucum.html (and implemented in Java UCUM and consequently in LINDT). It's just not structured in RDF.

NOT everything needs or should be structured in RDF. Eg are you against GeoSPARQL literals (WKT and GML)?

maximelefrancois86 commented 4 years ago

Dear all,

@namedgraph , there are sometimes good rationale to encode complex values using literals instead of relying on RDF structures and basic datatypes. The OGC GeoSPARQL datatype geo:WKTLiteral is a great example.

"<http://www.opengis.net/def/crs/OGC/1.3/CRS84> Polygon((-83.6 34.1, -83.6 34.5, -83.2 34.5, -83.2 34.1, -83.6 34.1))”^^geo:WKTLiteral

I back up the thoughts of @VladimirAlexiev :

namedgraph commented 4 years ago

OK fine, WKT literals is a counter-example. But GeoSPARQL is an additional standard, not part of the SPARQL spec. And you seem to have done the same with units. So what's the problem? Why does it need to be in SPARQL 1.2 proper?

maximelefrancois86 commented 4 years ago

I am neutral on this. I also think it would be fine to have such a datatype specified in a separate document.

VladimirAlexiev commented 4 years ago

I don't see this becoming part of SPARQL 1.2. As I said "Adopt LINDT as a best practice" and work with other communities to adopt it.

@maximelefrancois86

maximelefrancois86 commented 4 years ago

Thank you @VladimirAlexiev , we are on the same page.

For complex numbers: we just had a very good first year Master student that worked on this during her 3-months internship this year: Yana Soares de Paula. https://www.linkedin.com/in/yanaspaula/ She did an excellent job in just three months, but more work would be needed to augment cdt:ucum with complex numbers. She would probably happy to share her report with you if you wish

About the broken links in lindt v1, I'll create an issue and check asap. Thanks for the notice.

VladimirAlexiev commented 4 years ago

There exist implementations of UCUM in common programming languages

Eg there seem to be 2 for JS:

It appears UCUM is the dominant UoM system in life sciences.

maximelefrancois86 commented 4 years ago

See also

Maybe there are more

dr-shorthair commented 4 years ago

Connecting UCUM unit symbols to ontologies like QUDT

@VladimirAlexiev Most QUDT units now have UCUM codes in their description, so correlation of these two systems is already available. (I did this work in the last few months.) The ones that are missing do not have equivalent UCUM codes, so there is nothing on that side to correlate with.

QUDT gives you explicit dimension-vectors, and conversion factors (and offsets, where appropriate).

I'm also on the UCUM Advisory Board, and the licensing issue is high on the agenda. Though the current UCUM Terms of Use look a bit fierce at first glance, I have been assured that the kind of usage that is envisaged here is totally fine, and the intention is to make this more clear in the license.

dr-shorthair commented 4 years ago

On the matter of style: I vote with @HolgerKnublauch in favour of

273^^ucum:K

compared with

"273 K"^^cdt:ucum

It does not require a string to be parsed, so basic SPARQL queries can be used, detecting the datatype, but without regexing strings.

dr-shorthair commented 4 years ago

@JervenBolleman I think your table matches the UCUM codes, except for Ohm for Ω (note case). That is no surprise as UCUM was designed to use the common codes as far as possible.

This XML representation is the reference for the UCUM terminals.

sa-bpelakh commented 4 years ago

It does seem like the best way to adhere to standard and industry-wide use but avoid mixing domain-specific aspects into the generic SPARQL standard, this should be an auxiliary standard, like GeoSPARQL. The complex cdt:ucum literals are quite similar to WKT in this aspect.

dr-shorthair commented 4 years ago

Scaling factors for scalar quantities are not domain specific at all.

In fact I'd argue it is a notable failure of almost all computer languages that this is not built-in. There are very few pure 'floating point' numbers, or 'decimals' that can be understood without knowing the unit-of-measure.

I'm totally fine with embedding coordinate sequences in a microformat, since they have no meaning considered independently. I was in the team that standardized GeoSPARQL and am very comfortable with the design choice. But scalar quantities are a very different matter, and much more simple.

HolgerKnublauch commented 4 years ago

I also think that units are a different topic than GeoSPARQL. Users should expect to perform comparisons using the built-in < and > operators, and possibly to do arithmetics such as + and * on unit'ed (is this a word?) values. This might of course just become a matter of enough implementations agreeing on a de-facto standard, but it shouldn't be too hard to agree on a mechanism at least for the most common units in a SPARQL 1.2. Once it's in SPARQL then related standards such as SHACL would automatically "inherit" these features, e.g. for sh:minInclusive.

JervenBolleman commented 4 years ago

@dr-shorthair I expanded my comment, changed to unit:Ohm, whose casing is inconsistent over standards.

@HolgerKnublauch I also think easier support by stores for custom datatypes would be very nice. And would make implementing this feature cheaper for everyone. Let's open a separate issue for easier custom datatypes. (Also easier sharing of custom function definitions).

@VladimirAlexiev I would love a full UCUM support for some projects I am involved in. I am just worried that it would be to large a code base for independent smaller SPARQL communities to implement. Also I think we end up with a downstream licensing issue with UCUM until their license is changed. Which might take a long time.

kasei commented 4 years ago

@JervenBolleman I could see standardization of service description vocabulary terms for describing which custom datatypes are supported. Beyond that, though, wouldn't "easier custom datatypes" be an issue for individual implementations (and not something the spec can/should concern itself with)? What would spec involvement in this area look like?

VladimirAlexiev commented 4 years ago

@dr-shorthair where do you have mapping tables QUDT-UCUM showing in particular the gaps on either side?

As I wrote above, it's useful to have in RDF (QUDT) what UCUM libraries provide in code.

Please comment on how you would represent the variety of UCUM strings (including annotations in curlies) as datatype URLs. You picked the easiest case K.

@sa-bpelakh Agreed! As I said above, this can only be a recommended best practice, can't be part of the SPARQL spec.

@HolgerKnublauch

perform comparisons using the built-in < and > operators, and possibly to do arithmetics such as + and * on unit'ed (is this a word?)

For comparison, + and - you need Commensurate quantities (having same dimensionality). You can apply * and / to any quantities, and also between quantities and simple numbers.

LINDT does all that.

sh:minInclusive

Yes! And sh:lessThan

@JervenBolleman

too large a code base for independent smaller SPARQL communities to implement

UCUM has implementations in many languages, they should leverage such implementations. LINDT uses UCUM Java and hooks up into Jena datatype handlers to override SPARQL operators.

kasei commented 4 years ago

too large a code base for independent smaller SPARQL communities to implement

UCUM has implementations in many languages, they should leverage such implementations. LINDT uses UCUM Java and hooks up into Jena datatype handlers to override SPARQL operators.

As an implementor of several SPARQL systems in less popular languages, I join @JervenBolleman in concern at the implementation burden. Just because something like this has implementations in several languages does not mean there wouldn't be a real cost added to many existing (and possibly future!) systems.

ashleysommer commented 4 years ago

I'm a maintainer of the Python RDFLib (including its SPARQL executor) and developer of PySHACL.

I agree with @VladimirAlexiev on this one. After reading the UCUM Spec I don't see how individual 273^^ucum:K could work for all of the possible combinations of units of measurement allowed by UCUM.

You'd need a string representation like "273 K"^^cdt:ucum, a simple example is 10 millimeters of mercury (for pressure measurement) "10 mm[Hg]"^^cdt:ucum and for a more extreme example "ventricular stroke work" in "gramforce-meter per heartbeat per square meter" "4 gf.m/({hb}.m2)"^^cdt:ucum.

While the set of units defined in UCUM is closed, the microformat is created in such a way that adding new units (in a subsequent version) is easy and predictable. If every unit in the current spec was pulled out into a discrete datatype in the ucum ontology, that would need to be updated whenever a new unit is added to UCUM.

VladimirAlexiev commented 4 years ago

Here is the offending license clause:

Subject to Section 1 and the other restrictions hereof, users may incorporate portions of the UCUM table and definitions into another master term dictionary (e.g. laboratory test definition database), or software program for distribution outside of the user's corporation or organization, provided that any such master term dictionary or software program includes the following fields reproduced in their entirety from the UCUM table: UCUM code, definition value and unit. Every copy of the UCUM table incorporated into or distributed in conjunction with another database or software program must include the following notice:

“This product includes all or a portion of the UCUM table, UCUM codes, and UCUM definitions or is derived from it, subject to a license from Regenstrief Institute, Inc. and The UCUM Organization. Your use of the UCUM table, UCUM codes, UCUM definitions also is subject to this license, a copy of which is available at http://unisofmeasure.org. The current complete UCUM table, UCUM Specification are available for download at http://unitsofmeasure.org. The UCUM table and UCUM codes are copyright © 1995-2013, Regenstrief Institute, Inc. and the Unified Codes for Units of Measures (UCUM) Organization. All rights reserved.

THE UCUM TABLE (IN ALL FORMATS), UCUM DEFINITIONS, AND SPECIFICATION ARE PROVIDED "AS IS." ANY EXPRESS OR IMPLIED WARRANTIES ARE DISCLAIMED, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE.”

If the master term dictionary or software program containing the UCUM table, UCUM definitions and/or UCUM specification is distributed with a printed license, this statement must appear in the printed license. Where the master term dictionary or software program containing the UCUM table, UCUM definitions, and/or UCUM specification is distributed on a fixed storage medium, a text file containing this information also must be stored on the storage medium in a file called "UCUM_short_license.txt". Where the master term dictionary or software program containing the UCUM table, UCUM definitions, and/or UCUM specification is distributed via the Internet, this information must be accessible on the same Internet page from which the product is available for download.


HOWEVER, see the comment above that the UCUM committee says it's ok to use UCUM as described in this issue. I.e. not to get too hung up on this legalese (which is indeed pretty bad, compared to modern open licences)

dr-shorthair commented 4 years ago

@VladimirAlexiev

where do you have mapping tables QUDT-UCUM showing in particular the gaps on either side?

  1. get https://github.com/qudt/qudt-public-repo/blob/master/vocab/unit/VOCAB_QUDT-UNITS-ALL-v2.1.ttl
  2. Run
    SELECT *
    WHERE {
    ?p a qudt:Unit .
    MINUS { ?p a qudt:CurrencyUnit . }
    OPTIONAL { ?p qudt:ucumCode ?u . }
    }
  3. I don't think there are any gaps on the QUDT side relative to the UCUM terminals, but since UCUM does not define a closed set (novel combinations are always possible) there will not be a member in the QUDT catalogue for every arbitrary UCUM code.
dr-shorthair commented 4 years ago

Regarding the license, remember that (notwithstanding the use of XML for the reference data) UCUM was developed in a pre-linked-data, and pre-CC world. And since UCUM is widely built in to medical and clinical software, there was a concern to ensure that there not be any muddling of libraries and conversions. That would be a real problem. As I am in contact with both the QUDT and UCUM maintainers, clarifying the license is a priority to clear up any issues about the appearance of UCUM codes alongside a separately derived set of conversion factors. But this should definitely NOT impede mentioning UCUM codes in RDF and SPARQL.

Note that the US National Library of Medicine is now providing the main support for UCUM, including an API and a Javascript library.

dr-shorthair commented 4 years ago

@ashleysommer

I agree with @VladimirAlexiev on this one. After reading the UCUM Spec I don't see how individual 273^^ucum:K could work for all of the possible combinations of units of measurement allowed by UCUM.

Indeed, we do need to think this through.

For the pattern that @HolgerKnublauch and I advocate, the ucum:XXXX datatype reference must implicitly point to a member of an (unbounded) set of codes, composed of every possible 'legal' combination of the UCUM terminals. I don't think I've seen a datatype reference used in that way before.

Also, I haven't checked what the possible escaping implications of the UCUM grammar are for a token appearing in that position in an RDF graph. / . ( ) [ ] { } ' " - + are all used in UCUM codes. / ( ) can mostly be avoided by using the option of negative exponents and dots, but there might be difficulty with some of the others.

JervenBolleman commented 4 years ago

I thought it useful, to describe my gut feeling regarding ucum in the literals. In RDF we have always gone for things not strings. Here I feel we are regressing somewhat, knowing already that there are known confusions in UCUM strings, I am personally hesitant to recommend it in a linked data setting. Specifically, the way that UCUM specifies behaviour towards annotations {} is ripe for issues in implementations/comparisons (if I read wrong let me know).

dr-shorthair commented 4 years ago

The messiness around {annotations} is certainly a thing.

However, in my opinion the primary challenge is that the actual gamut of units-of-measure used in practice is effectively infinite. And that is not a bug, it is a feature. So any approach to units of measure has to address that.

The good news is that it is not intractable. You can define an algorithm to build the symbol, using a finite set of atoms (hundreds, but not thousands), so that it looks exactly like the common symbol for common units. This is all laid out in ISO 80000 and adapted for the ASCII keyboard by UCUM. There is an unambiguous mapping from any UCUM code to its semantics, though the codes for many derived units are not unique (to take a simple case, metres-per-second can be written either m/s or m.s-1).

I understand your concern about 'literals vs things', but I fear that it follows from the actual requirement - which I believe is to have a general way of representing the scale (i.e. unit of measure) for any quantity in RDF. If you can see another solution that satisfies the requirement and also meets the reasonable expectations of people who would be typing or selecting these things (e.g. scientists and engineers can already read or write m/s or m.s-1), then that would be great.

HolgerKnublauch commented 4 years ago

Just as input here: Assuming we would use units as datatypes and not encoded in strings. It becomes possible to use rdfs:range, owl:allValuesFrom or sh:datatype statements, e.g.

ex:Thing-width a sh:PropertyShape ; sh:path ex:width ; sh:datatype ucum:m .

ex:SomeThing ex:width "42"^^ucum:m .

versus

ex:SomeThing ex:width "42 m"^^ucum:unit .

The former appears less redundant, e.g. on user input widgets - why repeat yourself. But more importantly, someone would need to invent new ways of stating what kinds of values are permitted a values, e.g. a new SHACL constraint component for the kind of unit (length vs temperature, for example). This might still be needed in cases where users can choose between multiple units, but not every use case is like that.

If there is a well-defined mapping from unit variations to strings then I guess these strings can also be used for the local name of the datatype URIs. Once the datatypes have URIs, it arguably becomes also easier to attach additional machine-processable information to them such as the conversion factor to the base unit. With strings all this becomes rather hidden in non-declarative algorithms.

dr-shorthair commented 4 years ago

@ashleysommer

If every unit in the current spec was pulled out into a discrete datatype in the ucum ontology, that would need to be updated whenever a new unit is added to UCUM.

I don't think it is feasible to generate and maintain static semantic representations of the unit denoted by every UCUM ever cited.

However, it is possible to dynamically (a) verify that a code is valid UCUM, and (b) give the conversion multiplier to the SI unit with the same dimension vector. The services at NLM do the checking and conversion, and QUDT can provide the dimension vector and the SI units (with a bit of help from SPARQL).

VladimirAlexiev commented 4 years ago

@JervenBolleman

known confusions in UCUM strings

Please give some pointers or examples

behaviour towards annotations {} is ripe for issues in implementations/comparisons

Annotations in particular are easy to handle: they don't change the meaning so you just discard them. And they cannot be nested.

More importantly, handling UCUM (parsing, conversions, etc) is a solved problem in major languages.

@HolgerKnublauch

ex:SomeThing ex:width "42 m"^^ucum:unit

LINDT includes the basic dimensional units eg cdt:length, and there's a function to check quantity conformance.

JervenBolleman commented 4 years ago

@ashleysommer @dr-shorthair I don't even think we would ever need to generate a file with all possible combinations. As long as we can make them linked data (there is of course the prior art of an infinite set of numbers being available as linked open data ;) ).

There is a potential for a grammar to generate the datatype IRI's. The '{}' brackets are currently not escapable (not part of the set PN_LOCAL_ESC).

So today we can do, in valid sparql 1.1.

SELECT *
WHERE
   { ?measurement1 ex:value "1"^^unit:kg\/mass
     ?measurement2 ex:value "2"^^unit:\% .
 }

We can't do

SELECT *
WHERE  { 
   ?measurement1 ex:value "1"^^unit:kg\/mass\{person\} .
   ?measurement2 ex:value "2"^^unit:\%\{vol\} .
 }

The {} bracket not in the escape set is probably because they are not allowed in IRI's per rfc3987 however, HTML5 whatwg URL specification allows them. For SPARQL I feel we can follow HTML5

JervenBolleman commented 4 years ago

@VladimirAlexiev See UCUM itself

UCUM annotations are easily discarded, but annotations do have semantics. {H.B.}/min is not the same as {drops}/min and just discarding is risky in a clinical setting.

Lot's of things are solved in existing languages with known libraries. But in my opinion, that does not mean they should be copy pasted into the SPARQL ecosystem. But don't let my opinion stop you from writing a SEP.

nicholascar commented 4 years ago

the microformat is created in such a way that adding new units (in a subsequent version) is easy and predictable

microformats in RDF === bad

I'm struggling with microformats for WKT etc. in GeoSPARQL 1.1. Please void microformats and use the RDF main methods of dealing with the meaning of data.

A user of RDF/SPARQL shouldn't have to resort to a micro format parser to understand any of the content in databases / queries, only graph walking and Linked Data dereferencing.

VladimirAlexiev commented 4 years ago

@nicholascar There are many successful microformats in RDF:

Just because you can represent many things in RDF doesn't mean you should. IMHO you should represent in RDF only "high-value data":

Normal people may say:

:said "WTF?"@zh-Hant-Latn-pynin;
:on "2020-11-10T12:05:01"^^xsd:dateTime;
:repeat "100 1/s{frequency}"^^cdt:ucum

Moderate RDF dogmatists that "only" want to replace the first 2 bullets above may have to say:

:said [
  a :Sentence;
  :kind :interrogative
  :value [:chars ("W" "T" "F")]
  :language [
    a :LanguageSpecification;
    :lang iana_lang:zh;
    :dialect iana_lang:Hant;
    :script iana_lang:Latn;
    :latinization iana_lang:pynin
  ]];
:on [
  a :DateTime;
  :year 2020;
  :month 11;
  :day 10;
  :hour 12;
  :minutes 05;
  :seconds 01];
:repeat [
  a :Quantity;
  :value [
    a :Integer;
    :literal "100"];
  :unit ucum:1\/s\{frequency\}];

Just two more examples:

Ontotext is a repository vendor so I have a vested interest to deal with more and more triples. Nevertheless, I believe that dogmatic approaches harm RDF: adding more triples and complexity turns people away from RDF.

There is https://github.com/w3c/EasierRDF ... but what you are arguing here is for harder RDF.

nicholascar commented 4 years ago

@VladimirAlexiev your comments overstate both my interests and understate my understanding. I don't mean we should break everything down into triples and I have argued for years to preserve some microformats, or non-RDF formats, for certain purposes. I would not suggest turning all literals into chained nodes.

The issues in GeoSPARQL 1.1 are about whether Coordinate Reference Systems need to be encoded in microformats, like GML/WKT, or should be broken out into other properties within Geometry elements.

I do think that units of measure are one of the scenarios that should be represented with RDF and not microformats. Conversion vectors, notation, etc. all fine to be microformats for UoM but the core ideas of numerical quantities, scales and so on should be in RDF. The reasoning is that we are talking about multiple orthogonal concerns within units, scales and so on of measure, and to be able to work out whether measurements across datasets are commensurate, i.e. can be used together properly, we need to be able to join on one or more of these orthogonal dimensions of measurements. If we don't call out those dimensions as first class things in RDF, we can't easily match them with graph pattern matching like SPARQL.

SELECT ?measurement
WHERE {
  ?measurement 
    ex:hasUnit <some-unit> ;
    ex:ofSomeProperty <some-property> ;
    ex:hasNumericalValue ?value .

  FILTER (?value > "x" && ?value < "y")
}

Multiple dimensions being mapped above and there could be others. UoM ontologies think all this through...

Regarding GeoSPARQL, I do think that Coordinate Reference Systems should be recorded outside coordinate literals (i.e. a property to indicate WGS84 rather than "crs:wgs84 POLYGON (...)" but for historical reasons and for the whole non-RDF world, we already have this CRS recording taking place within the literal so we have to continue to) handle it in future GeoSPARQLs.

But for literals in domains for which there isn't strong, and widely accepted, microformats, we shouldn't build them in to RDF/SPARQL. UoM is such a domain: there really isn't a universally accepted set of microformats for them like there is for, say WTK.

VladimirAlexiev commented 4 years ago

@nicholascar The motivation for having the CRS in wktLiteral is that in this way it's self-describing. A literal that uses easting/northing is quite different from a literal that uses latitude/longitude.

However, I agree with you that keeping the CRS as a separate prop is better. (I thought you're arguing to break the literal down to points).

work out whether measurements across datasets are commensurate

But it's easy to connect UCUM and QUDT!

We can pack this to a SPARQL function lindt:qudtUnit(cdt:ucum). Why should you have to do such lookup? Because it's a much easier task than doing quantity comparisons and arithmetics!

FILTER (?value > "x" && ?value < "y")

Ah, but here is the key! What would you write for x and y to conform to <some-unit> that you fetched dynamically? Or how would you compare two commensurate quantities that you fetched dynamically?

Please note that the makers of QUDT (@HolgerKnublauch et al) agree the unit should go together with the literal. They only argue it should be in a datatype not in the string (eg "1"^^ucum:m instead of "1m"^^cdt:ucum or the more concrete "1m"^^cdt:length).

UoM ontologies think all this through

Except comparisons and arithmetics

UoM is such a domain: there really isn't a universally accepted set of microformats

NLM says you're wrong, see https://github.com/w3c/sparql-12/issues/129#issuecomment-722440526 and eg https://specifications.openehr.org/releases/RM/Release-1.0.3/data_types.html#_dv_quantity_class

It is ironic that there isn't a universally accepted way to express the two parts of a quantity in RDF, as you illustrate above:

ex:hasUnit <some-unit> ; ex:hasNumericalValue ?value .

https://schema.org/QuantitativeValue and https://schema.org/PropertyValue specify something like that, but they are a bit of a mess of compromises:

VladimirAlexiev commented 4 years ago

@JervenBolleman

UCUM annotations are easily discarded, but annotations do have semantics. {H.B.}/min is not the same as {drops}/min and just discarding is risky in a clinical setting.

Agreed! They have their meaning, but no computational semantics. The two above are comparable to Hz (with a conversion factor of 1/60), but that doesn't mean it is meaningful to compare them.

To quote from https://ucum.org/ucum.html#para-6

Annotations do not contribute to the semantics of the unit but are meaningless by definition. Therefore, any fully conformant parser must discard all annotations. Curly braces are here because people want annotations and deeply believe that they need annotations. Especially in chemistry and biomedical sciences, there are traditional habits to write annotations at units or instead of units, such as “%vol.”, “RBC”, “CFU”, “kg(wet tis.)”, or “mL(total)”. These habits are hard to overcome. Any attempt of a coding scheme to restrict this percieved expressiveness will ultimately result in the coding scheme not being adopted, or just “half-way” adopted (which is as bad as not adopted). Two alternative responses to this reality exist: either give in to the bad habits and blow up of the code with dimension- and meaningless unit atoms, or canalize this habit so that it does no harm. The Unified Code for Units of Measure canalizes this habit using curly braces.

@HolgerKnublauch and @nicholascar Please heed the bold text in prev paragraph. As @dr-shorthair and I explained, UCUM describes infinitely many units.

VladimirAlexiev commented 4 years ago

@kasei I found only this for Perl: http://perl.overmeer.net/geo/html/jump.cgi?Geo_EOP&375 . It's part of http://perl.overmeer.net/geo-eop/source/ (2015-07). http://perl.overmeer.net/geo/#versions claims it's released on CPAN but I can't find it. And it only supports angle, distance.

See https://github.com/lhncbc/ucum-lhc/tree/master/data for implementation resources, in particular ucumDefs.json and ucum-essence.xml

dr-shorthair commented 4 years ago

I suspect we are in almost violent agreement here folks!

There are a few genuine issues, but none that I would die in a ditch over. For example, I could live with either style of encoding. I currently prefer 273.1^^ucum:K because even I can write a simple SPARQL 1.1 query to find all the values with a specific unit, or to find the unit (expressed as a URI) for any value. But if a SPARQL "1.2" function was available to parse strings like "273.1 K" for me, and return the numeric part and the unit, then I could certainly live with that too.

(Best of all if there was also a function to convert scaled values to and from SI values.)

If you got the impression @nicholascar thought the gamut of units was finite, then that is just a miscommunication or misreading at some point. Nick knows a lot more than that. But as you detected, I like UCUM a lot because it supports all the combinations explicitly. And it is the product of long experience in a huge field. I dislike QUDT because the maintainers (currently) store way too many static representations, and I look forward to when it is a dynamic system. I very much like QUDT because it has explicit dimension vectors, and support for (semantic) QuantityKinds. Neither satisfies all requirements.

VladimirAlexiev commented 4 years ago

@JervenBolleman re https://ucum.org/ucum.html#section-Summary-of-Conflicts

I checked the first two: conflict unit metric non-metric
Pa Pascal - pressure Peta-annum (peta-years)
Gb Gilbert – magnetic tension Giga-barn (action area)

These non-metric units are extremely unlikely, and as it says

there is only a conflict if the metric predicate is violated so that non-metric units are used with a prefix

I checked at https://ucum.nlm.nih.gov/ucum-lhc/demo.html, and this JS library resolves to the metric unit.

I think these few conflicts are not a serious concern.

JervenBolleman commented 4 years ago

@VladimirAlexiev there are more conflicts when using annotations (and that is my clinical experience) same annotation meaning different things in different systems. These are serious concern for my day job.

Also UCUM is limited in reach, e.g. missing indian survey feet and many more units known outside of the US. Which is why I think the unit should be a datatype and not encoded in the literal. As we can always mint our own datatypes and be assured of no collisions.

The constraints that UCUM operates under (needs to be a case insensitive string in ASCII 7bit) are IMHO so tight as to make some options very difficult.

Requiring units to be fully computed for comparison to work is common in all other literals. e.g. we don't support FILTER("2+2"^^xsd:integer = 4) and that is natural. I don't see why we should support FILTER("(1000m)/(60min)"^^ucum:... = "1km/hr"^^ucum:... = "16.6666666667"^^ucum:m\/s).

This becomes important when dealing with historical data where measurements have been redefined but the same unit name has been used. This is outside the clinical world but should also be considered as this specification is about more than just clinical use.

I am not convinced that the temporary relief by including UCUM/LINDT into encoded literals for sparql engines is the way to go. I think we can get much more value from different approaches to custom datatypes and shareable functions.

Experience with xsd:Durations show that micro formats are a real cost to implementers and xsd:Duration is much better supported in the wild across language ecosystems than UCUM is.

dr-shorthair commented 4 years ago

Which is why I think the unit should be a datatype and not encoded in the literal.

Yes - this is a strong argument. It is more immediately extensible.

The constraints that UCUM operates under (needs to be a case insensitive string in ASCII 7bit)

There is a case-sensitive option. I used the case-sensitive version when I added UCUM codes to QUDT.

VladimirAlexiev commented 4 years ago

@JervenBolleman

I looked at one of the Java implementations https://github.com/unitsofmeasurement/uom-systems/ and they mention more UoM systems (although I could not find a "Unicode CLDR Unit System").

Also, in issue https://github.com/unitsofmeasurement/uom-systems/issues/156 they state "UCUM development seems to have stalled since 2017".

Maybe we should read JSR 385. Here's its use cases section. I'm not sure whether it specifies particular UoM systems to support.

we don't support FILTER("2+2"^^xsd:integer = 4) and that is natural

But we support FILTER("2"^^xsd:integer + 2 = 4). What is unnatural is that we don't support "1 m"^^cdt:ucum + "100 cm"^^cdt:ucum or "1"^^ucum:m + "100"^^ucum:cm.

same annotation meaning different things in different systems. These are serious concern for my day job.

Then work on standardizing annotations. At least UCUM has made space for them, which no other UoM ontology or system has done.


Jerven and @HolgerKnublauch, do we all agree that we must have overloaded operators to handle comparisons of comparable units, and arithmetics (between all kinds of units, and with numbers)?

ericprud commented 4 years ago

@VladimirAlexiev

@JervenBolleman

we don't support FILTER("2+2"^^xsd:integer = 4) and that is natural

But we support FILTER("2"^^xsd:integer + 2 = 4). What is unnatural is that we don't support "1 m"^^cdt:ucum + "100 cm"^^cdt:ucum or "1"^^ucum:m + "100"^^ucum:cm.

I think a more apt analogy for "1"^^ucum:m + "100"^^ucum:cm would be FILTER(2 + 2.0 = 4). The fact that "2"^^xsd:integer parses to the same internal representation as 2 is just an feature of the parser semantics. The ability to add a double and an integer and compare the result to an integer (in fact, the comparison substitutes the double 4.0) is orchestrated by XPath's numeric type promotion and type substitution. Extrapolating that to apply to units would give us that same functionality and some nice unit analysis as a side benefit. I can see a couple ways to do that:

Canonical units

For every dimension we specify (length, charge, mass...), pick a canonical unit. MKS would be practical and would add another attractor tugging the US forward to the 18th century). Enumerate all of the compatible units with linear functions mapping them to the canonical: ucum:m -> +0, 1 ucum:m ucum:in -> +0, .0254 ucum:m ucum:f -> -32, *1.8 ucum:c

Any evaluation requiring the promotion of the left column to the right column applies the transformation and leaves you with the canonical units. Where the current operator table has entries like

Operator Type(A) Type(B) Function Result type
A + B numeric numeric op:numeric-add(A, B) numeric
we could add entries for the dimensions: Operator Type(A) Type(B) Function Result type
A + B length length op:numeric-add(A, B) length

This is cool because the operator table prevents us from adding a length to a time. It's a little funny because everything gets metrified, e.g. (BIND "1"^^ucum:ft + "1"^^ucum:in AS ?x) will give you ".3302"^^ucum:m.

Unit ladder

We could ameliorate that a bit by group entries in the type promotion hierarchy so that known imperial units stay imperial and get promoted to the smallest imperial unit, so (BIND "1"^^ucum:ft + "1"^^ucum:in AS ?x) will give you "13"^^ucum:in. Things that don't fit into one of those groups would still get metrified (yes, i made that word up), e.g. (BIND "1"^^ucum:lightyear + "1"^^ucum:parsec AS ?x) will give you "4.0318165349E16"^^ucum:m.

P.S.

It would be lovely extend the grammar so we could write 1ft instead of "1"^^ucum:foot (which as a parser feature, is orthogonal to the "1"^^ucum:foot vs. "1ft"^^ucum:length debate. I guess feasibility comes down to how crazy the lexical strings for the units are.

sa-bpelakh commented 4 years ago

@VladimirAlexiev

Canonical units

I like the design for canonical units, and the implementation is well defined. I definitely prefer "1"^^ucum:foot instead of "1ft"^^ucum:length, because the unit implies the dimension, and avoids a micro-grammar in the literal value.

I think the complexity of the unit ladder could be avoided if you allow casting conversions, e.g. bind(ucum:foot(?a + ?b +?c) as ?length_in_feet)) to guarantee a specific unit (and do dimension checking in the process)

kasei commented 4 years ago

@ericprud

It's a little funny because everything gets metrified, e.g. (BIND "1"^^ucum:ft + "1"^^ucum:in AS ?x) will give you ".3302"^^ucum:m.

I would think this could be handled just like the XPath constructor functions:

ucum:in("1"^^ucum:ft + "1"^^ucum:in) => "13"^^ucum:in

(Though there might be some funny floating point error issues to consider.)