nvs-vocabs / P06

A controlled vocabulary for units of measurement
0 stars 1 forks source link

Add UCUM codes #12

Open dr-shorthair opened 3 years ago

dr-shorthair commented 3 years ago

Suggest adding the UCUM codes, as a skos:notation, maybe with a datatype. UCUM is a very sound set of unit symbols generally matching the ones already used.

UCUM spec is here: https://ucum.org/ucum.html

And there is a UCUM-based quantity conversion service API here: https://ucum.nlm.nih.gov/ucum-service.html (and a UI here: https://ucum.nlm.nih.gov/ucum-lhc/demo.html )

gwemon commented 3 years ago

Thank you @dr-shorthair . I'll add it as a possible enhancement. It does not look particularly straightforward to me. Maybe we can discuss this next time we meet.

dr-shorthair commented 3 years ago

Here you are.

P06-ucum.ttl

gwemon commented 3 years ago

👍 Thank you @dr-shorthair

dr-shorthair commented 3 years ago

Added SQUM P06-ucum.ttl

gwemon commented 3 years ago

Update on this: Adding UCUM codes as an extra SKOS element would require us to extend our schema but we think that it would be a good move indeed if we were to replace our existing altLabel with one (our preferred) UCUM symbols for each of the P06 unit. We will circulate this proposal to key users of the P06 vocab. This would enable us to harmonise against the impressive work done by the UCUM team.

dr-shorthair commented 3 years ago

When you say 'extend our schema' do you mean (a) the Oracle schema used for maintenance and point-of-truth, or (b) the RDF output schema? If the latter, then I don't think an extension is needed: skos:notation would be appropriate.

I see you are already using skos:notation for the SDN identifier (duplicated in dc:identifier and dce:identifier). This should not be a problem: skos:notation can be repeated. However, to make it useable I'd recommend adding an rdfs:Datatype for UCUM - e.g. see https://github.com/qudt/qudt-public-repo/blob/master/schema/SCHEMA_QUDT-v2.1.ttl#L1577 which is used in https://github.com/qudt/qudt-public-repo/blob/master/vocab/unit/VOCAB_QUDT-UNITS-ALL-v2.1.ttl

dr-shorthair commented 3 years ago

Fix a few errors P06-ucum.ttl

gwemon commented 3 years ago

When you say 'extend our schema' do you mean (a) the Oracle schema used for maintenance and point-of-truth, or (b) the RDF output schema? If the latter, then I don't think an extension is needed: skos:notation would be appropriate.

I see you are already using skos:notation for the SDN identifier (duplicated in dc:identifier and dce:identifier). This should not be a problem: skos:notation can be repeated. However, to make it useable I'd recommend adding an rdfs:Datatype for UCUM - e.g. see https://github.com/qudt/qudt-public-repo/blob/master/schema/SCHEMA_QUDT-v2.1.ttl#L1577 which is used in https://github.com/qudt/qudt-public-repo/blob/master/vocab/unit/VOCAB_QUDT-UNITS-ALL-v2.1.ttl

@dr-shorthair I meant (a) because, unless we use one of the existing fields (in this case altLabel is an option) then we need to add a new field to store this information somewhere in the Oracle schema before outputing it into the RDF schema. We will look at options.

dr-shorthair commented 3 years ago

There is now some work going on in OBO to re-boot UO as a cross-walk for existing units vocabs. They are using UCUM codes as the key, so would be helpful if the UCUM codes got incorporated into P06 so it could be added to the OBO mappings.

kaiiam commented 3 years ago

Thanks @dr-shorthair! That work is being supported (on my end) as part of BCODMO's revamped data management vocabulary. As we have alignment/mapping with NERC in mind (cc @ashepherd @jaclynsaunders @DanieK), I'd love to see this happen. The new OBO unit vocab effort has already managed to leverage @dr-shorthair's extensive work making UCUM mappings to bridge to QUDT and OM. I'd love to see the same happen with NERC P06.

gwemon commented 3 years ago

Hi @kaiiam thank you for your enthusiasm and support! This is definitely the plan and the update is ready to go, just waiting for a quiet window of time to push it through to production. With regards to alignments between BCODMO and NVS vocabs, we submitted an abstract a year ago for the IMDIS conference that was postponed by 6 months due to COVID (see https://imdis.seadatanet.org/files/IMDIS2021_143_abstract.pdf). I am putting the slides together at the moment. It'd be great to have some examples from you too.

kaiiam commented 3 years ago

@gwemon thanks maybe once we have the new unit system ready to show, and we've imported the NERC P06 - UCUM mappings I could prepare a slide showing the new OBO unit vocab linking off to NERC P06, QUDT, and OM.

gwemon commented 3 years ago

@dr-shorthair We've hit an issue with the proposal to replace P06 alternative labels with the UCUM notation when I was made aware to the fact that thousands of our ODV and netCDF files use the P06 alternative label to refer to the units (a legacy issue). There is therefore a risk in implementing what I was proposing because this field would change for 2/3rd of the units held in P06. So, instead, we are proposing to capture the UCUM notation in structured XML in the definition field by adding UCUMcode</skos:notation> Would that work? If required, this could be a first step before we can update the code that creates the RDf and output it as a separate field. I will use this for the recent P06 terms you submitted, as examples. If it is an acceptable option then I'll update the rest of the P06 definition fields with the UCUM notation.

roy-lowry commented 3 years ago

@gwemon You can't embed XML like that into the definition element of the XML output because it will break the schema. That is why the XML in the structured description fields in bodccodes is translated into JSON by the the NVS software.

Have a look at http://vocab.nerc.ac.uk/collection/C19/current/UKMDN025/ You will see that the definition element contains the following JSON:

{"Spatial_Coverage": { "Southernmost_latitude": "53.245121", "Northernmost_latitude": "53.455889", "Westernmost_longitude": "-3.068628", "Easternmost_longitude": "-2.680371" }}

However, if you select description from bodccodes where codval='UKMDN025' you will see that is an XML snippet, not JSON.

When I suggested the solution of using a structured definition element I was envisioning it containing the existing text plus the UCUM code encoded in JSON along the lines of:

{"notes": "existing_text", "ucum": "ucum_code"}

This would be produced by setting bodccodes.description to:

[notes]existing_text[/notes][ucum]ucum_code[/ucum] where '[' is an opening chevron and ']' is a closing chevron: I can't work out how to escape chevrons so they don't upset GitHub.

(if description is currently null then there would be no notes element).

Of course, you could simplify the whole process by encoding bodccodes.description in JSON.

kaiiam commented 3 years ago

Regarding mappings to the new unit interchange system our goal is to make valid ttl IDs from UCUM strings. E.g.,

image

We certainly don't want to interfere with how the NERC P06 is built/functions. In order to be compatible with our system all that would be required is a basic mapping between P06 IRI's and UCUM strings, which could be in whatever format (csv json ttl etc) that we could parse.

Something like:

NERC IRI UCUM
http://vocab.nerc.ac.uk/collection/P06/current/AMPB/ A
http://vocab.nerc.ac.uk/collection/P06/current/BQ11/ Bq

In principal this mapping could even just be pulled from @dr-shorthair's P06-ucum.ttl assuming it's correct up to date etc.

roy-lowry commented 3 years ago

@gwemon Am I missing something or could the necessary be achieved through the URLMAP_EXT mechanism? To explain to @kaiiam this is an easy to implement mechanism that causes the UCUM code encoded as a URI (I assume that this is possible) to be included as a mapping in the RDF output in the P06 code.

To see what I mean, look at:

http://vocab.nerc.ac.uk/collection/P07/current/EHEBBEHE/

The URL

http://mmisw.org/ont/cf/parameter/sea_water_preformed_salinity

in the output is delivered from the back office through the URLMAP_EXT mechanism.

gwemon commented 3 years ago

@roy-lowry UCUM notations are not available as URIs so we cannot do this.

roy-lowry commented 3 years ago

@gwemon Thought it was too good to be true. So, back to structured definitions....

kaiiam commented 3 years ago

@roy-lowry UCUM notations are not available as URIs so we cannot do this.

Yes exactly hence the whole point of our new units interchange system to make resolvable IRIs based on the SI and UCUM. Some UCUM stings e.g. {#}.g-1 can't just be cast to an IRI ID and be valid ttl syntax. So we're working on workarounds for this.

For now I'll parse @dr-shorthair's P06-ucum.ttl file (like I did with his work on QUDT and OM) so that the new system can have an initial maping to P06. Later once this is resolved, I can update the mapping to pull from something the NERC team is producing and or updating.

So, back to structured definitions....

Another goal of our system is to auto-produce labels, definitions and mappings based on the input UCUM/SI codes e.g. given the input code mA the script creates the following ttl:

image

Although BCODMO is supporting this work (thanks again @ashepherd) the idea is to make it general enough for anyone to use and it will be under something like a CC0 license (anyone can use for any purpose). Happy to have this be a collaboration point with NERC if there is interest. We've still got lots to do with this new unit system.

gwemon commented 3 years ago

Thanks for commnets @kaiiam @roy-lowry @dr-shorthair do you have a dedicated datatype for UCUM units?

dr-shorthair commented 3 years ago

qudt:UCUMcs

kaiiam commented 3 years ago

I've implemented mapping @dr-shorthair P06-ucum.ttl into the beta units interchange system.

image
gwemon commented 3 years ago

Great. Thank you @kaiiam. @alko-k and I are looking our options from the NVS side.

kaiiam commented 3 years ago

Great @gwemon no pressure from us, I just wanted to make sure to leverage @dr-shorthair extensive mapping work to make sure the new system actually serves as a mapping between as many unit systems as possible including NERC P06.