Filter on information content?

cbizon commented 1 year ago

Here is a cypher query that is ending up in uberongraph coming from strider:

MATCH (`h`)-[`edge_1`:`biolink:has_phenotype`]->(`on`) WHERE  ( (`edge_1`.`biolink:primary_knowledge_source` IS NOT NULL) )  MATCH (`on`)-[`on_subclass_edge`:`biolink:subclass_of`*0..1]->(`on_superclass`:`biolink:Disease` {}) USING INDEX `on_superclass`:`biolink:Disease`(id) WHERE  ( (`on_superclass`.id in ["MONDO:0001982", "MONDO:0018982"]) )  MATCH (`h`)-[`h_subclass_edge`:`biolink:subclass_of`*0..1]->(`h_superclass`:`biolink:Disease` {}) USING INDEX `h_superclass`:`biolink:Disease`(id) WHERE  ( (`h_superclass`.id in ["MONDO:0005144", "MONDO:0007256", "MONDO:0001156", "MONDO:0005550", "MONDO:0017376", "MONDO:0007263", "MONDO:0003019", "MONDO:0002025", "MONDO:0005277", "MONDO:0008433", "MONDO:0004967", "MONDO:0002050", "MONDO:0005798", "MONDO:0001609", "MONDO:0005027", "MONDO:0008947", "MONDO:0019338", "MONDO:0007972", "MONDO:0000598", "MONDO:0006730", "UMLS:C0008066", "MONDO:0005101", "MONDO:0009061", "MONDO:0004058", "MONDO:0001071", "MONDO:0003240", "MONDO:0005364", "MONDO:0002254", "MONDO:0002806", "MONDO:0004992", "MONDO:0007254", "MONDO:0004988", "MONDO:0005071", "MONDO:0015129", "MONDO:0004355", "MONDO:0004609", "MONDO:0005546", "MONDO:0009690", "MONDO:0005059", "MONDO:0005139", "MONDO:0007182", "MONDO:0044903", "MONDO:0007264", "MONDO:0001158", "MONDO:0015667", "MONDO:0007603", "MONDO:0001673", "MONDO:0000249", "MONDO:0002909", "MONDO:0005485", "MONDO:0007863", "MONDO:0005180", "MONDO:0002049", "MONDO:0002009", "MONDO:0001866", "MONDO:0006585", "MONDO:0005260", "MONDO:0007661", "MONDO:0005015", "MONDO:0020320", "MONDO:0042485", "MONDO:0004375", "MONDO:0001475", "MONDO:0005352", "MONDO:0019496", "MONDO:0004979", "MONDO:0015447", "MONDO:0005770", "MONDO:0003785", "MONDO:0005147", "MONDO:0001252", "MONDO:0004425", "MONDO:0002032", "MONDO:0005146", "MONDO:0017181", "MONDO:0009820", "MONDO:0018935", "MONDO:0018874", "MONDO:0005258", "MONDO:0100096", "MONDO:0005070", "MONDO:0004985", "MONDO:0005420", "MONDO:0005492", "MONDO:0004975", "MONDO:0006745", "MONDO:0005618", "MONDO:0021042", "MONDO:0006166", "MONDO:0019091", "MONDO:0000693", "MONDO:0000437", "MONDO:0006664", "MONDO:0005155", "MONDO:0005203", "MONDO:0021178", "MONDO:0002623", "MONDO:0001627", "MONDO:0011918", "UMLS:C2931853", "MONDO:0020128", "MONDO:0005145", "MONDO:0021063", "MONDO:0001153", "MONDO:0005363", "MONDO:0001741", "MONDO:0011122", "MONDO:0003634", "MONDO:0001484", "MONDO:0005466", "MONDO:0010096", "MONDO:0004617", "MONDO:0005371", "MONDO:0005351", "MONDO:0010138", "MONDO:0007739", "MONDO:0010383", "MONDO:0019960", "MONDO:0005570", "MONDO:0002280", "MONDO:0005148", "MONDO:0005354", "MONDO:0004976", "MONDO:0015909", "MONDO:0007452", "MONDO:0004996", "MONDO:0001442", "MONDO:0009807", "UMLS:C0001723", "MONDO:0005395", "MONDO:0005140", "MONDO:0004471", "MONDO:0005090", "MONDO:0016383", "MONDO:0001185", "MONDO:0013600", "MONDO:0017178", "MONDO:0005021", "MONDO:0001566", "MONDO:0005046", "MONDO:0018088", "MONDO:0005072", "MONDO:0005084"]) )

This is a 1-hop, connecting 2 pinned nodes. But because of the subclassing, many many more edges are searched than you might expect by looking at MxN.

From the bigger set of ~100 mondos, there are about 50,000 subclasses found. They're heavily skewed towards a few very high-level nodes:

I increasingly think we can save time and reduce false positives by filtering these chunky nodes inside of strider. I also suspect that using information content as a proxy would work well. The first couple in that list have an IC < 35 as reported by nodenorm.

My proposal is that for non-pinned nodes, strider gets the IC from its NN calls and uses that to filter with a parameter that can be specified at query time. If we want, we could make that cutoff default to 0, or we could make a choice to start it higher, like 35 or 40.

maximusunc commented 1 year ago

Here's what I'm thinking for this:

We don't filter out any pinned nodes from the query graph. The user meant to have those in.
We don't filter on the end/answer node. We want to provide all the common answers that people are expecting to be there.
The filter threshold will be an environment variable that can be changed easily, and I believe Abrar mentioned setting it to 75 as a default.
Any node that has no information content will be kept.

I think this is a good place to start, and we can tweak as needed once the filtering is in place and we see how much it helps.

cbizon commented 1 year ago

Could we make it a parameter on the query? Maybe an attribute in the TRAPI somewhere? Then we can easily experiment without fiddling with anything

maximusunc commented 1 year ago

Added in #416

ranking-agent / strider

Filter on information content? #406