microbiomedata / nmdc-server

Data portal client and server for NMDC.
https://data.microbiomedata.org
Other
9 stars 0 forks source link

data portal - environmental local scale search issue #1173

Open aclum opened 4 months ago

aclum commented 4 months ago

The data portal is showing counts for a parent class without showing the child classes they belong to. I was using the data portal yesterday to find the environmental local context terms for NEON soil samples from Colorado (search filter) When I then try to navigate to the term value I see astronomical body part has a count of 517 but none of the children have any counts, compared to fiat object part which has a child, environmental zone which can be used to navigate down to the actual terms. One of the expected values several leaves down is 'area of gramanoid or herbaceous vegetation' See image for a view of no counts for children of astronomical body part @turbomam confirmed that if he looks at the the ontology json independently the terms should be able to be navigated to via astronomical body part. Based on this we believe the issue is on the nmdc-server side rather than with the nmdco-classes.json file.

Mark's comments: Ontology Access Kit can be used to check nmdco-classes.json independently of the DataPortal

runoak --input pronto:nmdco-classes.json tree 'area of gramanoid or herbaceous vegetation'
jeffbaumes commented 4 months ago

I believe what is happening is that the repeated subtree starting at "ENVO:01001199 ! terrestrial environmental zone" is (arbitrarily) only placed under "ENVO:01000408 ! environmental zone" and not "ENVO:01000813 ! astronomical body part". The current implementation never repeats subtrees. The real data structure for the ontology is a complex inter-dependent directed acyclic graph and simplifying assumptions were made for the UI.

This is the relevant issue describing the complexity and tradeoffs.

This notebook shows the general approach we decided to take to "treeify" a complex directed acyclic graph. Note that it states:

The first step is to make the directed acyclic graph into a tree. We do this by arbitrarily taking the first parent node from each node as the true parent, and discarding the rest of the parent links

Note the comment from @cmungall in the linked issue:

ontology group defines initial exclusion sets (e.g. astronomical body part)

This makes me think that "ENVO:01000813 ! astronomical body part" should not be a part of the tree presented to the user, perhaps for exactly this reason that it would lead to a duplicated ontology subtree. So that would be my proposed solution to this particular inconsistency.

aclum commented 4 months ago

This is the relevant nmdc-sever code https://github.com/microbiomedata/nmdc-server/blob/main/nmdc_server/ingest/envo.py

turbomam commented 4 months ago

Good comments about DAGs vs trees, @jeffbaumes . I misspoke in today's meeting.

turbomam commented 4 months ago

I think this should be driven by user stories. What are the searching or browsing patterns we expect?

Would anybody ever search for an intermediate node that has been left out of the exposed hierarchy? If so, then maybe we should omit intermediate nodes.

Would anybody skip the searching step and just browse through the subclasses? If so, then we should probably include all paths to a leaf.

The OBO foundry community is really active in developing tools for this kinds of thing and the obo-community Slack workspace is really active. If none of you want to join that, I can pass on any questions or requests you have, if you share them with me.

aclum commented 4 months ago

My use case from this week was browsing through sub-classes.