tetherless-world / mcs-apps

DARPA Machine Common Sense (MCS) applications for exploring knowledge graphs of common sense, benchmark questions, and question-answering processes
MIT License
4 stars 1 forks source link

Speed up source filtering in neo4j #201

Open gordom6 opened 4 years ago

gordom6 commented 4 years ago

Sources are currently (20200813) modeled as separate nodes in neo4j, with KgNode's connected via :SOURCE. Neo4j doesn't allow you to index relationships like that: https://community.neo4j.com/t/how-can-i-use-index-in-relationship/1627

Apparently Lucene indices are the way to go: https://neo4j.com/docs/cypher-manual/current/administration/indexes-for-full-text-search/ Relationship indexes?

I'm also open to remodeling the way we handle sources. We moved to the current model because nodes can't have multi-valued properties.

gordom6 commented 4 years ago

I noticed this is slow when the full CSKG is loaded, it takes on the order of seconds to get "all nodes in ConceptNet" (an example queries), which suggests it's doing a full scan of the database.

123joshuawu commented 4 years ago

@gordom6 Did some digging and I think the culprit is the getMatchingNodesCount query.

For the "all nodes in ConceptNet" query
I did a quick measure of elapsed execution time

getMatchingNodes elapsed <1s
getMatchingNodesFacets elapsed <1s
getMatchingNodesCount elapsed 5.8s

the cypher query run by getMatchingNodesCount is

MATCH (node: Node), (source0:Source { id: "CN" })
WHERE (node)-[:SOURCE]-(source0)
RETURN COUNT(node)

I also profiled the query which confirms your suspicion that "it's doing a full scan of the database" plan(3)

I think that MATCH then WHERE as opposed to just using MATCH (as shown below) is the source of the slowdown

MATCH (node: Node), (source0:Source { id: "CN"}), (source0)-[:SOURCE]-(node)
RETURN COUNT(node)
gordom6 commented 4 years ago

Sounds reasonable. Can you fix it or would you like me to?

123joshuawu commented 4 years ago

I will give it a shot

gordom6 commented 3 years ago

This is still taking ~15 seconds to load the example "nodes from WordNet" query.