isamplesorg / isamples_inabox

Provides functionality intermediate to a collection and central
0 stars 1 forks source link

GEOME's has_context_categories used to have taxon data in them, how to handle now that keywords are managed with a schema? #312

Open dannymandel opened 9 months ago

dannymandel commented 9 months ago

The previous implementation of the GEOME has_context_categories method looked like this:

    def has_context_categories(self) -> typing.List[str]:
        if self._session is not None:
            ranks = ["kingdom", "phylum", "genus"]
            ranks_to_check = []
            for rank in ranks:
                value = self._source_record_main_record().get(rank)
                if value is not None and value != "unidentified":
                    ranks_to_check.append(value)
            for rank in ranks_to_check:
                kingdom = kingdom_for_taxonomy_name(self._session, rank)
                if kingdom is not None:
                    return [kingdom]
        # Didn't find one, return empty
        return []

It looks like the examples all now have marinewaterbody set, e.g. https://github.com/isamplesorg/metadata/blob/a59d9b35062643928f868f85da5b32bb02a6b357/examples/GEOME/test1.0Valid/ark-21547-AvL2C02_201705281001-v1.json#L9

Should this just be hardcoded to marinewaterbody? FWIW, the taxon ranks are now included in the keywords, so we haven't lost this information.

smrgeoinfo commented 9 months ago

I cheated some when I was making those example instances and did some inferencing to get the environment that the organism inhabits and used that as the sampled feature. In this case the sample is a coral, so the inference is 'marine environment'. I don't seem to have saved the original raw JSON for this record, so I'm not sure what all was there. I probably also inferred 'coral reef' as the sampled feature. I was (am) hoping that we could use the machine learning tools Sarah S. is working on to train this kind of inferencing.

In the mean time... It looks like in the code above, the has_context_category key, which takes a a value rom the sampledfeature vocabulary was using the taxon rank ("kingdom", "phylum", "genus")?
The sampledfeature vocabulary has "Biological entity", with definition "Sampled feature is an organism. Use for samples that represent some species of organism as the proximate sampled feature for which the focus is not the environment that the organism inhabits." This might well apply to many GEOME samples, and the simple default for now might be to use that if we can't figure out the environment sampled.

There is a biologicalEntityExtension vocabulary (https://github.com/isamplesorg/vocabularies/blob/main/src/extensions/biologicEntityExtension.ttl) that has the kingdom-level subclasses of biological entity. The next level would be to get the kingdom name from the GEOME record and match that to the extension vocabulary and add that as a has_context_category value.

dannymandel commented 9 months ago

@datadavev this was the issue we were discussing in this morning's standup around the kingdom vocabularies. Whatever you implement should support this use case as well.