microbiomedata / sheets_and_friends

Enhance a LinkML model with imported and optionally modified slots
0 stars 0 forks source link

Update water content validation #143

Closed mslarae13 closed 1 year ago

mslarae13 commented 1 year ago

Water content and water content method are 2 MIxS fields used in the NMDC submission template.

Currently, MIxS says water content method

water content

Water content (in soils and sediment can be measured in a variety of ways, hence the water content method fields. You can use

All are slightly different formatting.

How do we validate this?

mslarae13 commented 1 year ago

@turbomam FYI

turbomam commented 1 year ago

Thanks for the examples.

MIxS specifies that values for water_content (aka "water content", aka vMIXS:0000185) should be measurement values. I don't think measurement value is actually defined in the MIxS Sheets, but @cmungall's team has equated them with NMDC's quantity values, which have the following sub-attributes

DataHarmonizer takes input that's flattened, not structured, so we have translated the MIxS Value syntax of {float} {unit} into a requirement for

  1. a floating point number
  2. followed by exactly one whitespace
  3. followed by a unit string that doesn't include any whitespaces

Informally speaking

That allows us to parse the flattened string from DataHarmonizer into the quantity value structure described above, which should make searching (and possibly even unit conversion) more fruitful.

But it's not compatible with values that you and other scientist use!

I think you are suggesting that we turn all validation off, allowing any string. That would be a quick fix, but it would lead to worse search and unit conversion results.

I would prefer to globally revise quantity value to allow whitespace in the unit portion and/or even allow zero or more whitespaces between the value and the unit. Do you think allowing that flexibility would have a bad impact on any of the other fields/columns/slots?

Here's what your examples (plus ont of my own) get parsed into if we send them directly to the quantulum3 parser without any additional validation. Most but not all of them can be parsed faithfully into values and units.

from quantulum3 import parser

examples = [
    "75%",
    "75 %",
    ".75",
    "5 g water / g dry soil",
    "5 cc per cc",
    "5 cc/cc",
    ".75",
    "75% water",
    ".75 g water per g soil WHC",
    "60% WFPS",
    "5 g/g",
]

for ex in examples:
    ex_parsed=parser.parse(ex)
    print(f'"{ex}" is parsed into {ex_parsed}')

PS depth is a quantity value, too. That's why we can retire depth2 now, after converting the current MongoDB contents. I think we may already have some emails or GH issues on that, and I will follow up there.

turbomam commented 1 year ago

Would also address https://github.com/microbiomedata/sheets_and_friends/issues/140 from @pvangay

mslarae13 commented 1 year ago

@turbomam I'm good with that proposed solution. As long as it will validate. We don't need to make it open string. @cmungall do you have an opinion?

pvangay commented 1 year ago

@turbomam catching up on this issue. I think your proposed solution here makes a ton of sense.

allow whitespace in the unit portion and/or even allow zero or more whitespaces between the value and the unit

Am I reading this correctly that these two would incorrectly parse? Is there a way to address this?

"5 g water / g dry soil" is parsed into [Quantity(5, "Unit(name="gram", entity=Entity("mass"), uri=Gram)")] ?".75 g water per g soil WHC" is parsed into [Quantity(0.75, "Unit(name="gram", entity=Entity("mass"), uri=Gram)")]

Agree that this would definitely also fix https://github.com/microbiomedata/sheets_and_friends/issues/140

turbomam commented 1 year ago

I'll make a regexr test page containing my proposed validation and you can try some values that you think should pass and values that you think shouldn't pass. Even better, you could make a list of three or four of each in advance.

The two parsing result you provided are the real output from the value/unit parser we use, quantulum3. Getting those compound units to parse out would require us writing our own custom NMDC value/unit parser, or retraining the quantulum3 parser.

Note that unit parsing and value/unit validation are two different things.

ssarrafan commented 1 year ago

Based on discussion at Infrastructure sync meeting, adding to the August sprint

mslarae13 commented 1 year ago

Will update submission portal schema to allow for validation to pass. However, the chosen solution makes it difficult to parse the results & will need re-visited. Marking this as the interim fix.

See https://github.com/microbiomedata/sheets_and_friends/issues/148 for next step in correcting this.

ssarrafan commented 1 year ago

@mslarae13 is the interim fix done? Can this issue be closed?

mslarae13 commented 1 year ago

yes. water content validates now