hubmapconsortium / search-api

HuBMAP search service and associated pieces to create an index
https://search.api.hubmapconsortium.org
MIT License
2 stars 2 forks source link

Fix `add_partonomy.py` #420

Open mccalluc opened 2 years ago

mccalluc commented 2 years ago

For now, just revert to flat organ facets:

Then:

bherr2 commented 2 years ago

I was coming here to say that the ccf-partonomy.jsonld is a tree, but Now I see in my code how there is an edge case that could have prevented that! We'll be updating the ccf with new data in the next two weeks, so I'll try to get that piece fixed in the process!

Also thanks for resurfacing this.

mccalluc commented 2 years ago

Questions from @computationdoc :

... and Portal UI is already built assuming the anatomy is a proper tree is my understanding from the problem description from Nils. If Portal UI does allow polyhierarchies and still counts correctly and searches correctly we can use ‘all shortest paths‘ rather than ‘single shortest path’ in neo4j so the difference in code is trivial to these tools so no big rewrite to allow polyhierarchies if we choose to.

For hierarchical faceting, the portal documents need to include anatomy_# fields like this:

{...
 'anatomy_0': ['body'],
 'anatomy_1': ['large intestine'],
 'anatomy_2': ['transverse colon'],
...}

Can you point me to a current JSON for the format Portal UI is currently receiving or is transforming to - Is it a simple tree of ccf_annotations in JSON, and you are separately indexing from search the datasets to each distinct ccf_annotation to ‘know’ the facet counts?

Currently, the code downloads https://cdn.jsdelivr.net/gh/hubmapconsortium/hubmap-ontology@1.6.0/ccf-partonomy.jsonld. Code in add_partonomy.py parses and traverses this file, but I have no particular attachment to it: Feel free to revise, or throw it away, if more of the heavy lifting can be done in the API.

I have one question: is there any downside to all single valued - i.e. having multiple paths each identical except different lowest value (rather than multi valued at lowest level)? Obviously one could post hoc combine them but I’d just assume skip that step if it’s irrelevant since the process I’m envisioning creates a set of single-value paths from each unique value in the data to ‘body’ anyway.

​If I understand correctly, you are asking if instead of

{...
 'anatomy_0': ['body'],
 'anatomy_1': ['large intestine'],
 'anatomy_2': ['transverse colon', 'ascending colon'],
...}

it could be something like:

{...
  'anatomy': [
    ['body', 'large intestine', 'transverse colon'],
    ['body', 'large intestine', 'ascending colon']
  ],
...}

If I'm misunderstanding, please provide an example of what you have in mind.

If my understanding is correct: This would be a great response from the API, and a great intermediate representation, but for the index, we need anatomy_0, anatomy_1, etc. To make the hierarchy useful in search, the levels need to be cleanly separated, so we can do a high-level search by constraining just anatomy_0 and _1... or progressively more constrained searches by adding _2, _3, etc.

computationdoc commented 2 years ago

Good news. 1. We were able to verify yesterday that the new ASCT+B ingests properly into KG so we can take advantage of better trees in the infrastructure. and 2. Found the algorithm needed in neo4j documentation to essentially result a tree with guarantee of each node at only one level, so don't have to create the algorithm (its embedded in this documentation - I think in the "sequence" section if I recall correctly): https://neo4j-contrib.github.io/neo4j-apoc-procedures/index33.html#_expand_paths

So, bottom line, we can now build fairly simply a process that takes a list of codes/concepts (e.g., the set of AS from RUI data) from PROV and uses those to assemble the right "sub-graph" from KG without new algorithm work per se, just queries. We'll need to develop the proper spec and schedule the work still.

mccalluc commented 2 years ago

Update from Jesse:

Spoke to JS, as of now he is principally responsible. If we deliver it without using ASCTB, it can be done quickly. With it it's still not certain. Message to be sent out to figure out best course of action.

Reaching out to key members and will be able to provide an answer by end of the week.