Closed mslarae13 closed 1 year ago
@turbomam FYI
Thanks for the examples.
MIxS specifies that values for water_content
(aka "water content", aka vMIXS:0000185) should be measurement value
s. I don't think measurement value
is actually defined in the MIxS Sheets, but @cmungall's team has equated them with NMDC's quantity value
s, which have the following sub-attributes
quantity value➞has unit 0..1 Description: The unit of the quantity Range: Unit
quantity value➞has numeric value 0..1 Description: The number part of the quantity Range: Double
has minimum numeric value 0..1 Description: The minimum value part, expressed as number, of the quantity value when the value covers a range. Range: Float
has maximum numeric value 0..1 Description: The maximum value part, expressed as number, of the quantity value when the value covers a range. Range: Float
quantity value➞has raw value 0..1 Description: Unnormalized atomic string representation, should in syntax {number} {unit} Range: String
DataHarmonizer takes input that's flattened, not structured, so we have translated the MIxS Value syntax
of {float} {unit}
into a requirement for
Informally speaking
That allows us to parse the flattened string from DataHarmonizer into the quantity value
structure described above, which should make searching (and possibly even unit conversion) more fruitful.
But it's not compatible with values that you and other scientist use!
I think you are suggesting that we turn all validation off, allowing any string. That would be a quick fix, but it would lead to worse search and unit conversion results.
I would prefer to globally revise quantity value
to allow whitespace in the unit portion and/or even allow zero or more whitespaces between the value and the unit. Do you think allowing that flexibility would have a bad impact on any of the other fields/columns/slots?
Here's what your examples (plus ont of my own) get parsed into if we send them directly to the quantulum3 parser without any additional validation. Most but not all of them can be parsed faithfully into values and units.
from quantulum3 import parser
examples = [
"75%",
"75 %",
".75",
"5 g water / g dry soil",
"5 cc per cc",
"5 cc/cc",
".75",
"75% water",
".75 g water per g soil WHC",
"60% WFPS",
"5 g/g",
]
for ex in examples:
ex_parsed=parser.parse(ex)
print(f'"{ex}" is parsed into {ex_parsed}')
PS depth
is a quantity value
, too. That's why we can retire depth2
now, after converting the current MongoDB contents. I think we may already have some emails or GH issues on that, and I will follow up there.
Would also address https://github.com/microbiomedata/sheets_and_friends/issues/140 from @pvangay
@turbomam I'm good with that proposed solution. As long as it will validate. We don't need to make it open string. @cmungall do you have an opinion?
@turbomam catching up on this issue. I think your proposed solution here makes a ton of sense.
allow whitespace in the unit portion and/or even allow zero or more whitespaces between the value and the unit
Am I reading this correctly that these two would incorrectly parse? Is there a way to address this?
"5 g water / g dry soil" is parsed into [Quantity(5, "Unit(name="gram", entity=Entity("mass"), uri=Gram)")] ?".75 g water per g soil WHC" is parsed into [Quantity(0.75, "Unit(name="gram", entity=Entity("mass"), uri=Gram)")]
Agree that this would definitely also fix https://github.com/microbiomedata/sheets_and_friends/issues/140
I'll make a regexr test page containing my proposed validation and you can try some values that you think should pass and values that you think shouldn't pass. Even better, you could make a list of three or four of each in advance.
The two parsing result you provided are the real output from the value/unit parser we use, quantulum3. Getting those compound units to parse out would require us writing our own custom NMDC value/unit parser, or retraining the quantulum3 parser.
Note that unit parsing and value/unit validation are two different things.
Based on discussion at Infrastructure sync meeting, adding to the August sprint
Will update submission portal schema to allow for validation to pass. However, the chosen solution makes it difficult to parse the results & will need re-visited. Marking this as the interim fix.
See https://github.com/microbiomedata/sheets_and_friends/issues/148 for next step in correcting this.
@mslarae13 is the interim fix done? Can this issue be closed?
yes. water content validates now
Water content and water content method are 2 MIxS fields used in the NMDC submission template.
Currently, MIxS says water content method
water content
Water content (in soils and sediment can be measured in a variety of ways, hence the water content method fields. You can use
All are slightly different formatting.
How do we validate this?