ranking-agent / strider

A TRAPI-compliant component of ARAGORN that queries distributed KPs and assembles answers to user questions.
MIT License
3 stars 0 forks source link

Filter on information content? #406

Closed cbizon closed 1 year ago

cbizon commented 1 year ago

Here is a cypher query that is ending up in uberongraph coming from strider:

MATCH (`h`)-[`edge_1`:`biolink:has_phenotype`]->(`on`) WHERE  ( (`edge_1`.`biolink:primary_knowledge_source` IS NOT NULL) )  MATCH (`on`)-[`on_subclass_edge`:`biolink:subclass_of`*0..1]->(`on_superclass`:`biolink:Disease` {}) USING INDEX `on_superclass`:`biolink:Disease`(id) WHERE  ( (`on_superclass`.id in ["MONDO:0001982", "MONDO:0018982"]) )  MATCH (`h`)-[`h_subclass_edge`:`biolink:subclass_of`*0..1]->(`h_superclass`:`biolink:Disease` {}) USING INDEX `h_superclass`:`biolink:Disease`(id) WHERE  ( (`h_superclass`.id in ["MONDO:0005144", "MONDO:0007256", "MONDO:0001156", "MONDO:0005550", "MONDO:0017376", "MONDO:0007263", "MONDO:0003019", "MONDO:0002025", "MONDO:0005277", "MONDO:0008433", "MONDO:0004967", "MONDO:0002050", "MONDO:0005798", "MONDO:0001609", "MONDO:0005027", "MONDO:0008947", "MONDO:0019338", "MONDO:0007972", "MONDO:0000598", "MONDO:0006730", "UMLS:C0008066", "MONDO:0005101", "MONDO:0009061", "MONDO:0004058", "MONDO:0001071", "MONDO:0003240", "MONDO:0005364", "MONDO:0002254", "MONDO:0002806", "MONDO:0004992", "MONDO:0007254", "MONDO:0004988", "MONDO:0005071", "MONDO:0015129", "MONDO:0004355", "MONDO:0004609", "MONDO:0005546", "MONDO:0009690", "MONDO:0005059", "MONDO:0005139", "MONDO:0007182", "MONDO:0044903", "MONDO:0007264", "MONDO:0001158", "MONDO:0015667", "MONDO:0007603", "MONDO:0001673", "MONDO:0000249", "MONDO:0002909", "MONDO:0005485", "MONDO:0007863", "MONDO:0005180", "MONDO:0002049", "MONDO:0002009", "MONDO:0001866", "MONDO:0006585", "MONDO:0005260", "MONDO:0007661", "MONDO:0005015", "MONDO:0020320", "MONDO:0042485", "MONDO:0004375", "MONDO:0001475", "MONDO:0005352", "MONDO:0019496", "MONDO:0004979", "MONDO:0015447", "MONDO:0005770", "MONDO:0003785", "MONDO:0005147", "MONDO:0001252", "MONDO:0004425", "MONDO:0002032", "MONDO:0005146", "MONDO:0017181", "MONDO:0009820", "MONDO:0018935", "MONDO:0018874", "MONDO:0005258", "MONDO:0100096", "MONDO:0005070", "MONDO:0004985", "MONDO:0005420", "MONDO:0005492", "MONDO:0004975", "MONDO:0006745", "MONDO:0005618", "MONDO:0021042", "MONDO:0006166", "MONDO:0019091", "MONDO:0000693", "MONDO:0000437", "MONDO:0006664", "MONDO:0005155", "MONDO:0005203", "MONDO:0021178", "MONDO:0002623", "MONDO:0001627", "MONDO:0011918", "UMLS:C2931853", "MONDO:0020128", "MONDO:0005145", "MONDO:0021063", "MONDO:0001153", "MONDO:0005363", "MONDO:0001741", "MONDO:0011122", "MONDO:0003634", "MONDO:0001484", "MONDO:0005466", "MONDO:0010096", "MONDO:0004617", "MONDO:0005371", "MONDO:0005351", "MONDO:0010138", "MONDO:0007739", "MONDO:0010383", "MONDO:0019960", "MONDO:0005570", "MONDO:0002280", "MONDO:0005148", "MONDO:0005354", "MONDO:0004976", "MONDO:0015909", "MONDO:0007452", "MONDO:0004996", "MONDO:0001442", "MONDO:0009807", "UMLS:C0001723", "MONDO:0005395", "MONDO:0005140", "MONDO:0004471", "MONDO:0005090", "MONDO:0016383", "MONDO:0001185", "MONDO:0013600", "MONDO:0017178", "MONDO:0005021", "MONDO:0001566", "MONDO:0005046", "MONDO:0018088", "MONDO:0005072", "MONDO:0005084"]) )

This is a 1-hop, connecting 2 pinned nodes. But because of the subclassing, many many more edges are searched than you might expect by looking at MxN.

From the bigger set of ~100 mondos, there are about 50,000 subclasses found. They're heavily skewed towards a few very high-level nodes:

image

I increasingly think we can save time and reduce false positives by filtering these chunky nodes inside of strider. I also suspect that using information content as a proxy would work well. The first couple in that list have an IC < 35 as reported by nodenorm.

My proposal is that for non-pinned nodes, strider gets the IC from its NN calls and uses that to filter with a parameter that can be specified at query time. If we want, we could make that cutoff default to 0, or we could make a choice to start it higher, like 35 or 40.

maximusunc commented 1 year ago

Here's what I'm thinking for this:

I think this is a good place to start, and we can tweak as needed once the filtering is in place and we see how much it helps.

cbizon commented 1 year ago

Could we make it a parameter on the query? Maybe an attribute in the TRAPI somewhere? Then we can easily experiment without fiddling with anything

maximusunc commented 1 year ago

Added in #416