microbiomedata / submission-schema

https://microbiomedata.github.io/submission-schema/
MIT License
1 stars 1 forks source link

GOLD ecosystem pathway enumerations are out of date #154

Closed aclum closed 5 months ago

aclum commented 9 months ago

I'm not sure when this was last updated but GOLD's last release of ecosystem pathways was in Sept 2023. I noticed this because value of Peat for column specific ecosystem does not validate and confirmed it is not listed in the enumeration SpecificEcosystemEnum

We should

@turbomam @pkalita-lbl @mslarae13 @shreddd

mslarae13 commented 9 months ago

I agree. See comment in https://github.com/microbiomedata/nmdc-schema/issues/1108#issuecomment-1728584547

mslarae13 commented 5 months ago

We are missing Bulk Soil which is the 'specific_ecosystem` that Hugh wants to list the NEON samples as.

aclum commented 5 months ago

@turbomam do you have time update the GOLD pathway enumerations? current values in GOLD can be found here https://gold.jgi.doe.gov/ecosystem_classification

turbomam commented 5 months ago

Where can I find a textual representation of the GOLD pathway elements?

turbomam commented 5 months ago

Maybe here? GOLD's 5-Level Ecosystem Classification Paths Excel Last generated: 11 Jan, 2024

Clicking the link downloaded this file: GOLDs5levelEcosystemClassificationPaths.xlsx

This should be noted in the schema

turbomam commented 5 months ago

Are we adding all values form all five categories into the enums? Here's list of all five, ranked by the number of paths they appear in. I could report it some other way if you want.

Deleting long list for now. Will post somewhere else soon.

aclum commented 5 months ago

@turbomam yes please

turbomam commented 5 months ago

How are the GOLD path elements modeled in the nmdc-schema and the submission schema?

Here's the definition of SpecificEcosystemEnum in the compiled submission schema

And the other four enums, which are contiguous at this point in time.

An example value for EcosystemSubtypeEnum is Floodplain and is currently modeled in this style

      Floodplain:
        text: Floodplain
        description: placeholder PV descr

Floodplain doesn't appear anywhere in the nmdc-schema

I think these enumeration origiante in https://github.com/microbiomedata/submission-schema/blob/main/schemasheets/tsv_in/enums.tsv which has been hand-curated up until now.

turbomam commented 5 months ago

@pkalita-lbl can you please help me think about the GOLD path enum lifecycle?

turbomam commented 5 months ago

schemasheets/tsv_in/enums.tsv has the following columns:

For all practical purposes, we're just asserting the enum name and the permissible value name in DH pulldown column and DH pulldown option. I have been asserting 'placeholder PV descr' as the description for some black-magic reason that I can't remember

turbomam commented 5 months ago

Fetching ecosystem path data from GOLD

assets/GOLDs5levelEcosystemClassificationPaths.xlsx:
    curl -o $@ https://gold.jgi.doe.gov/download?mode=ecosystempaths

GOLD's source file calls the path elements

We are calling the enums

turbomam commented 5 months ago

I started working on this out of nmdc-schema. We can move this later if it does what you want.

pkalita-lbl commented 5 months ago

Right submission-schema has an enum for each of the GOLD pathway levels. They are definitely not complete. Like, the first three levels only allow one permissible value each (EcosystemEnum, EcosystemCategoryEnum, EcosystemTypeEnum). The other two offer more options, but again definitely not complete (EcosystemTypeEnum, SpecificEcosystemEnum). I assume the incompleteness was done on purpose because there sure are a lot of options, but that decision predates my time on this project.

There is also custom code in the submission portal that alters the behavior of those five columns so that you only get suggestions for valid paths. The logic is driven in part by this file: https://gold.jgi.doe.gov/download?mode=biosampleEcosystemsJson (we bake a copy into the submission portal code; we don't constantly re-fetch it). So for example, when you go to to fill in the specific_ecosystem column the options that get presented to you are determined by what the GOLD JSON file says are valid values based on the values in the 4 previous columns and then we subset that by what's permissible according to SpecificEcosystemEnum.

I see two options going forward:

  1. We could get rid of the 5 enums in submission-schema and make the range of the 5 slots string (mimicking what we do in nmdc-schema). Then the logic for the dropdowns in the submission portal would only be driven by the GOLD JSON file. That makes updating to get the latest GOLD terms easy; it's just that one file. The potential downside is that you lose the ability to exclude GOLD terms that we deem irrelevant.
  2. We write some kind of script to inject all of the GOLD terms into the corresponding enums in submission-schema. The update process would then be: run that script and commit the changes to submission-schema, update the GOLD JSON file in nmdc-server. Depending on how sophisticated we make that script would could potentially exclude certain GOLD paths if that's desired.
pkalita-lbl commented 5 months ago

Also a long time ago I tried generating a LinkML schema that encoded the valid pathways as rules (code here https://github.com/pkalita-lbl/gold-ecosystems-linkml). The result was so unwieldy that it was basically unusable. So no one suggest doing that!

turbomam commented 5 months ago

Thanks @pkalita-lbl !

I have implemented at least half of option 2. from above as

I don't mind if you decide to go with option 1. instead

turbomam commented 5 months ago

@pkalita-lbl (or anyone): Are subsets of the GOLD paths being created for the the different environmental contexts like soil or water?

pkalita-lbl commented 5 months ago

Are subsets of the GOLD paths being created for the the different environmental contexts like soil or water?

I'm not sure but I think that's another thing that will influence how we implement the long-term process for keeping us in sync with GOLD. So I'm not sure we're ready to jump into implementing anything quite yet.

mslarae13 commented 5 months ago

@pkalita-lbl

They are definitely not complete. Like, the first three levels only allow one permissible value each (EcosystemEnum, EcosystemCategoryEnum, EcosystemTypeEnum).

We did intentionally limit this. That said, how it's limited will vary from sample type to sample type (environmental extension to extension)

The other two offer more options, but again definitely not complete (EcosystemTypeEnum, SpecificEcosystemEnum).

The missing 'lower level' ecosystem terms are cuz GOLD updated and we didn't get the updates.

So for example, when you go to to fill in the specific_ecosystem column the options that get presented to you are determined by what the GOLD JSON file says are valid values based on the values in the 4 previous columns and then we subset that by what's permissible according to SpecificEcosystemEnum.

Yes, we don't want to lose this because it should build the same way the GOLD ecosystem tree does: https://gold.jgi.doe.gov/ecosystemtree

Are subsets of the GOLD paths being created for the the different environmental contexts like soil or water?

@turbomam pretty sure that's a yes. But we haven't done it. it's really just identifying where in the tree we would limit.. So, for water, https://gold.jgi.doe.gov/ecosystemtree Environmental > Aquatic (then the other 3 are any sub of that). @aclum please confirm.

pkalita-lbl commented 5 months ago

Done with: