Closed aclum closed 5 months ago
I agree. See comment in https://github.com/microbiomedata/nmdc-schema/issues/1108#issuecomment-1728584547
We are missing Bulk Soil which is the 'specific_ecosystem` that Hugh wants to list the NEON samples as.
@turbomam do you have time update the GOLD pathway enumerations? current values in GOLD can be found here https://gold.jgi.doe.gov/ecosystem_classification
Where can I find a textual representation of the GOLD pathway elements?
Maybe here? GOLD's 5-Level Ecosystem Classification Paths Excel Last generated: 11 Jan, 2024
Clicking the link downloaded this file: GOLDs5levelEcosystemClassificationPaths.xlsx
This should be noted in the schema
Are we adding all values form all five categories into the enums? Here's list of all five, ranked by the number of paths they appear in. I could report it some other way if you want.
Deleting long list for now. Will post somewhere else soon.
@turbomam yes please
Here's the definition of SpecificEcosystemEnum in the compiled submission schema
And the other four enums, which are contiguous at this point in time.
An example value for EcosystemSubtypeEnum is Floodplain and is currently modeled in this style
Floodplain:
text: Floodplain
description: placeholder PV descr
Floodplain doesn't appear anywhere in the nmdc-schema
I think these enumeration origiante in https://github.com/microbiomedata/submission-schema/blob/main/schemasheets/tsv_in/enums.tsv which has been hand-curated up until now.
@pkalita-lbl can you please help me think about the GOLD path enum lifecycle?
schemasheets/tsv_in/enums.tsv
has the following columns:
For all practical purposes, we're just asserting the enum name and the permissible value name in DH pulldown column
and DH pulldown option
. I have been asserting 'placeholder PV descr' as the description
for some black-magic reason that I can't remember
assets/GOLDs5levelEcosystemClassificationPaths.xlsx:
curl -o $@ https://gold.jgi.doe.gov/download?mode=ecosystempaths
GOLD's source file calls the path elements
We are calling the enums
I started working on this out of nmdc-schema. We can move this later if it does what you want.
Right submission-schema
has an enum for each of the GOLD pathway levels. They are definitely not complete. Like, the first three levels only allow one permissible value each (EcosystemEnum, EcosystemCategoryEnum, EcosystemTypeEnum). The other two offer more options, but again definitely not complete (EcosystemTypeEnum, SpecificEcosystemEnum). I assume the incompleteness was done on purpose because there sure are a lot of options, but that decision predates my time on this project.
There is also custom code in the submission portal that alters the behavior of those five columns so that you only get suggestions for valid paths. The logic is driven in part by this file: https://gold.jgi.doe.gov/download?mode=biosampleEcosystemsJson (we bake a copy into the submission portal code; we don't constantly re-fetch it). So for example, when you go to to fill in the specific_ecosystem
column the options that get presented to you are determined by what the GOLD JSON file says are valid values based on the values in the 4 previous columns and then we subset that by what's permissible according to SpecificEcosystemEnum
.
I see two options going forward:
submission-schema
and make the range of the 5 slots string
(mimicking what we do in nmdc-schema
). Then the logic for the dropdowns in the submission portal would only be driven by the GOLD JSON file. That makes updating to get the latest GOLD terms easy; it's just that one file. The potential downside is that you lose the ability to exclude GOLD terms that we deem irrelevant. submission-schema
. The update process would then be: run that script and commit the changes to submission-schema
, update the GOLD JSON file in nmdc-server
. Depending on how sophisticated we make that script would could potentially exclude certain GOLD paths if that's desired.Also a long time ago I tried generating a LinkML schema that encoded the valid pathways as rules
(code here https://github.com/pkalita-lbl/gold-ecosystems-linkml). The result was so unwieldy that it was basically unusable. So no one suggest doing that!
Thanks @pkalita-lbl !
I have implemented at least half of option 2. from above as
I don't mind if you decide to go with option 1. instead
@pkalita-lbl (or anyone): Are subsets of the GOLD paths being created for the the different environmental contexts like soil or water?
Are subsets of the GOLD paths being created for the the different environmental contexts like soil or water?
I'm not sure but I think that's another thing that will influence how we implement the long-term process for keeping us in sync with GOLD. So I'm not sure we're ready to jump into implementing anything quite yet.
@pkalita-lbl
They are definitely not complete. Like, the first three levels only allow one permissible value each (EcosystemEnum, EcosystemCategoryEnum, EcosystemTypeEnum).
We did intentionally limit this. That said, how it's limited will vary from sample type to sample type (environmental extension to extension)
The other two offer more options, but again definitely not complete (EcosystemTypeEnum, SpecificEcosystemEnum).
The missing 'lower level' ecosystem terms are cuz GOLD updated and we didn't get the updates.
So for example, when you go to to fill in the specific_ecosystem column the options that get presented to you are determined by what the GOLD JSON file says are valid values based on the values in the 4 previous columns and then we subset that by what's permissible according to SpecificEcosystemEnum.
Yes, we don't want to lose this because it should build the same way the GOLD ecosystem tree does: https://gold.jgi.doe.gov/ecosystemtree
Are subsets of the GOLD paths being created for the the different environmental contexts like soil or water?
@turbomam pretty sure that's a yes. But we haven't done it. it's really just identifying where in the tree we would limit.. So, for water, https://gold.jgi.doe.gov/ecosystemtree Environmental > Aquatic (then the other 3 are any sub of that). @aclum please confirm.
I'm not sure when this was last updated but GOLD's last release of ecosystem pathways was in Sept 2023. I noticed this because value of
Peat
for column specific ecosystem does not validate and confirmed it is not listed in the enumeration SpecificEcosystemEnumWe should
@turbomam @pkalita-lbl @mslarae13 @shreddd