microbiomedata / nmdc-schema

National Microbiome Data Collaborative (NMDC) unified data model
https://microbiomedata.github.io/nmdc-schema/
Creative Commons Zero v1.0 Universal
27 stars 8 forks source link

Create plan for dealing with units of measure within NMDC schema #1032

Closed aclum closed 1 year ago

aclum commented 1 year ago

Creating a ticket for this per the metadata meeting last week

Currently the range for has_unit

We want this to have some constraints or controlled vocabulary that is programmatically enforced to consistently populate this.

For example the has_units on slot 'depth' currently has the following values meter,metre,meters,null The UM ontology was suggested although this may not be expansive enough. @turbomam mentioned some built in options from link-ml

@mslarae13 @cmungall @turbomam

aclum commented 1 year ago

There is guidance from the GSC on this per the mixs v6 excel doc Units - Except a few cases, strict units are not defined for items in the MIGS/MIMS/MIENS checklists, wherever applicable the unit of choice should accompany the value of an item. The units should be in accordance with the The International System of Units (SI).

turbomam commented 1 year ago

I am really opposed to taking a 'please see' approach to this, but your research is a good starting point, @aclum.

There are lots of historical solutions for this situation, as well as some LinkML specific solutions:

  1. Saving separate value and unit slots for all of the MIxS quantity/measurement terms. Lots more slots, totally decoupled, etc. but with really fine grained control.
  2. Just storing a pre-composed string that includes the value and the unit, like MIxS illustrates. We could use validation patterns, even including a list of acceptable units, but that would be hard to maintain consistently.
  3. Forcing all MIxS quantity/measurement terms to take a numeric value only and asserting in the name, title or description what the one expected unit is. We already do that to some degree in the submission-schema.
  4. Storing the value and unit into an inlined class, like NMDC's QuantityValues. That's what we do for most of these MIxS terms in the nmdc-schema. We could also put a pattern constraint on the unit slot. We could also make subclasses like 'MassQuantityValue' and 'VolumeQuantityValue'. Note that @cmungall generally discourages proliferation of classes.
  5. using the LinkML mechanism of associating a formally defined/referenced unit with a slot. The unit slot links a SlotDefiniton to a single UnitOfMeasure, which has a variety of slots to reference internal and external definitions.

I will be pushing fro GSC to use the last solution, so I would be vary happy to see us commit to it.

aclum commented 1 year ago

based on @mslarae13 there are some slots where we aren't able to define a single unit of measure. If memory servers some of the same slots would have different unit of measurement if it were solid vs liquid for example. Montana, is that correct?

cmungall commented 1 year ago

Let's separate these two issues:

  1. What is our standard
  2. How should we implement

For 1, ideally we would just refer to MIxS. However, the MIxS guidance is unclear and underspecified. For example, as per the guidance that @aclum quotes, the units should be "in accordance" with SI. But what does this mean?

I think most people would understand that using SI would mean using symbols, e.g. m. However, MIxS implicitly favors spelled out names not symbols, like meter. Using names rather than symbols is a bad idea due to the different forms. Formally, SI uses metre as the name, but the NIST page uses meter since that is US-preferred. None of this would be a problem if symbols were used rather than names, but for reasons MIxS uses the names.

Regardless of names vs symbols, there are many ambiguities with derived units.

MIxS is also very ambiguous when it comes to pluralization. We would hope that singular forms are mandated to avoid further confusion, but this isn't the case

We can see examples like:

There is also no guidance on how to do non-number of cells per gram

There is a standard that solves all of these issues, UCUM https://ucum.org/. UCUM is the standard used in all health related data models and standards that I am aware of. UCUM provides a completely unambiguous system, and as far as I am aware every unit that could possibly be required in MIxS could be represented in UCUM. It's very easy to use

For example, micrograms per cubic meter is ug/m3.

UCUM provided standard validators, and a completely computable system.

There is a proposal for MIxS to adopt UCUM here:

My preference is that NMDC mandates UCUM, and we lobby for this to follow suit in MIxS. Strictly speaking we will be using different units than many of the "preferred units" in MIxS, but this is no more inconsistent than anything else

aclum commented 1 year ago

This sounds good to me, as long as what we are doing is clear others groups can interoperate as needed.

aclum commented 1 year ago

@cmungall will be at the metadata meeting on Wednesday?

turbomam commented 1 year ago

see also

which has the following slot:

aclum commented 1 year ago

I believe we agreed at the metadata meeting last week to implement UCUM so I will close this ticket and open a new one for implementation.