ranking-agent / reasoner-transpiler

A library for converting TRAPI queries into cypher queries, taking into account the biolink predicate hierarchy
0 stars 1 forks source link

Transpiler generates query with bad plan #70

Closed cbizon closed 1 year ago

cbizon commented 1 year ago

Sending this trapi to robokop works very quickly:

query={
    "message": {
      "query_graph": {
        "edges": {
          "e00": {
            "subject": "n00",
              "object": "n01",
          "predicates":["biolink:related_to"]
          },
          "e01": {
            "subject": "n01",
              "object": "n02",
          "predicates":["biolink:related_to"]
          }
        },
        "nodes": {
          "n00": {
            "ids": ['NCBIGene:5465'], #input_node_id_list,
            "categories": ["biolink:BiologicalEntity"]
          },
          "n01": {
              #"categories": ["biolink:BiologicalEntity"]
              "categories": ["biolink:BiologicalProcessOrActivity","biolink:Gene","biolink:Pathway"]
          },
          "n02": {
            "ids": ["HP:0001395"], #output_node_id_list, #
            #  "categories": ["biolink:BiologicalEntity"]
            "categories": ["biolink:DiseaseOrPhenotypicFeature"]
          }

        }
      }
    }
  }

But if the n01 category is changed to "categories": ["biolink:BiologicalEntity"] then this query takes forever.

The difference in the cypher is that when there is a single category, the transpiler writes n01 as (n01:biolink:BiologicalEntity) but when there are more than one, it instead makes n01 a NamedThing and puts the labels in a WHERE clause.

For some reason, this makes a big difference in performance. When the label is in a WHERE clause, the query plan is what you would expect: n00-n01 and n02-n01 and then intersect on n01. When the label is on the node, then for some reason neo4j changes to going n00-n01 and then n01-n02 and then intersecting with n02.

If you change the slow query to use the WHERE version, then neo4j uses the better query plan and performance is fine.

Is this a generally true thing though? Not sure how to evaluate....

cbizon commented 1 year ago

Furthermore, this seems to intersect badly with the predicate stuff that is generated in the cypher. The cypher has a big block of types in the edge, and then where clauses to make sure that the directionality is correct. But if you take all of that out, then the query runs quickly again and the plan is good.

cbizon commented 1 year ago

And you don't actually have to take all of that out - just the WHERE clause on the n00-n01 edge is enough. Taking out the predicates from the [] and leaving the WHERE doesn't actually help.

cbizon commented 1 year ago

I can see some possible simplifications:

  1. When the input edge is 'related_to' then you don't need any of this stuff, as any edge will match
  2. When the input edge is symmetric, then the WHERE can be simplified because you don't need to worry about which direction the edge points
  3. If we can assert that only the canonical predicate is used, then we can reduce each list to only include the canonical predicates
cbizon commented 1 year ago

This is all implemented in the last release