Add a USING JOIN? - Githubissues

Here is a graph that takes minutes in an ubergraph or robokopkg

MATCH (`h`)-[`edge_1`:`biolink:has_phenotype`]->(`on`) WHERE  ( (`edge_1`.`biolink:primary_knowledge_source` IS NOT NULL) )  MATCH (`on`)-[`on_subclass_edge`:`biolink:subclass_of`*0..1]->(`on_superclass`:`biolink:Disease` {}) USING INDEX `on_superclass`:`biolink:Disease`(id) WHERE  ( (`on_superclass`.id in ["MONDO:0001982", "MONDO:0018982"]) )  MATCH (`h`)-[`h_subclass_edge`:`biolink:subclass_of`*0..1]->(`h_superclass`:`biolink:Disease` {}) USING INDEX `h_superclass`:`biolink:Disease`(id) WHERE  ( (`h_superclass`.id in ["MONDO:0005144", "MONDO:0007256", "MONDO:0001156", "MONDO:0005550", "MONDO:0017376", "MONDO:0007263", "MONDO:0003019", "MONDO:0002025", "MONDO:0005277", "MONDO:0008433", "MONDO:0004967", "MONDO:0002050", "MONDO:0005798", "MONDO:0001609", "MONDO:0005027", "MONDO:0008947", "MONDO:0019338", "MONDO:0007972", "MONDO:0000598", "MONDO:0006730", "UMLS:C0008066", "MONDO:0005101", "MONDO:0009061", "MONDO:0004058", "MONDO:0001071", "MONDO:0003240", "MONDO:0005364", "MONDO:0002254", "MONDO:0002806", "MONDO:0004992", "MONDO:0007254", "MONDO:0004988", "MONDO:0005071", "MONDO:0015129", "MONDO:0004355", "MONDO:0004609", "MONDO:0005546", "MONDO:0009690", "MONDO:0005059", "MONDO:0005139", "MONDO:0007182", "MONDO:0044903", "MONDO:0007264", "MONDO:0001158", "MONDO:0015667", "MONDO:0007603", "MONDO:0001673", "MONDO:0000249", "MONDO:0002909", "MONDO:0005485", "MONDO:0007863", "MONDO:0005180", "MONDO:0002049", "MONDO:0002009", "MONDO:0001866", "MONDO:0006585", "MONDO:0005260", "MONDO:0007661", "MONDO:0005015", "MONDO:0020320", "MONDO:0042485", "MONDO:0004375", "MONDO:0001475", "MONDO:0005352", "MONDO:0019496", "MONDO:0004979", "MONDO:0015447", "MONDO:0005770", "MONDO:0003785", "MONDO:0005147", "MONDO:0001252", "MONDO:0004425", "MONDO:0002032", "MONDO:0005146", "MONDO:0017181", "MONDO:0009820", "MONDO:0018935", "MONDO:0018874", "MONDO:0005258", "MONDO:0100096", "MONDO:0005070", "MONDO:0004985", "MONDO:0005420", "MONDO:0005492", "MONDO:0004975", "MONDO:0006745", "MONDO:0005618", "MONDO:0021042", "MONDO:0006166", "MONDO:0019091", "MONDO:0000693", "MONDO:0000437", "MONDO:0006664", "MONDO:0005155", "MONDO:0005203", "MONDO:0021178", "MONDO:0002623", "MONDO:0001627", "MONDO:0011918", "UMLS:C2931853", "MONDO:0020128", "MONDO:0005145", "MONDO:0021063", "MONDO:0001153", "MONDO:0005363", "MONDO:0001741", "MONDO:0011122", "MONDO:0003634", "MONDO:0001484", "MONDO:0005466", "MONDO:0010096", "MONDO:0004617", "MONDO:0005371", "MONDO:0005351", "MONDO:0010138", "MONDO:0007739", "MONDO:0010383", "MONDO:0019960", "MONDO:0005570", "MONDO:0002280", "MONDO:0005148", "MONDO:0005354", "MONDO:0004976", "MONDO:0015909", "MONDO:0007452", "MONDO:0004996", "MONDO:0001442", "MONDO:0009807", "UMLS:C0001723", "MONDO:0005395", "MONDO:0005140", "MONDO:0004471", "MONDO:0005090", "MONDO:0016383", "MONDO:0001185", "MONDO:0013600", "MONDO:0017178", "MONDO:0005021", "MONDO:0001566", "MONDO:0005046", "MONDO:0018088", "MONDO:0005072", "MONDO:0005084"]) )

This is basically a 1 hop, but pinned at both ends, and with multiple possible values at each end. Of course, transpiler also cares about all the subclasses of these input identifiers. Notice also that there are a couple of id's on one end and many on the other. So the final query that gets made into cypher basically looks like:

(a few Ids)<-subclass_of-(on)-has_phenotype->(h)-subclass_of->(a lot of id's).

The profile looks like:

So it gets the subclass nodes on and h. It then has a choice: should it take the phenotype edges from the side where there are lots of ids or the side where there are just a few. Surprisingly, it chooses from the side where there are a lot, so then there are a lot of has phenotype edges that have to be pulled.

In this case, the query can be made much faster by setting where you want to join, like this:

USING JOIN on `h`

Then it forces the has_phenotype to come from the small side. If we wanted to make use of this, we'd need to be able to spot the situation: the input query is a single hop, with both ends pinned, and with more nodes on one side than the other. In that case, it MIGHT make sense to add this Join hint.

There are a couple of caveats that need exploration though before anything gets going. First, it's hard to say that it is much faster, because of the way neo4j caches results. You can look at the # of db hits, but I think that a more careful analysis turning off the cache would be called for. Second, and maybe worse? Is that what happens is that from the small side, none of the nodes have a has_phenotype edge. So the whole query gets abandoned at that point (saves a lot of time). But this might be a special case - perhaps if there were even one edge, you'd have to do enough work on the long side that the savings dissappear, or even worse perhaps neo4j's plan would actually be better in those cases for some reason.

So this will probably require some more careful testing before any implementation. Thoughts @YaphetKG ? @EvanDietzMorris ?

ranking-agent / reasoner-transpiler

Add a USING JOIN? #76