Open nw31304 opened 7 years ago
Here's a breakdown of the branching factor, partially because it may be useful but mostly because I like running Cypher:
neo4j> MATCH (:ReferenceNode)-->({ layer: "pointLayer"})-[:RTREE_ROOT]->(root) with root match p=(root)-[rels:RTREE_CHILD*0..]->() with length(rels) as len return len,count(*) order by len ;
+----------------+
| len | count(*) |
+----------------+
| 0 | 1 |
| 1 | 26 |
| 2 | 1721 |
| 3 | 71991 |
+----------------+
4 rows available after 304 ms, consumed after another 0 ms
OK, I've debugged a but more, and the issue is with the saveCount() method in src/main/java/org/neo4j/gis/spatial/rtree/RTreeIndex.java.
With essentially every (interesting) request to a procedure advertised in the spatial plugin, a call is made to SpatialProcedures.getLayerOrThrow(). This function in turn calls getLayer() on the spatial service, and ultimately a new instance of the layer class is instantiated. As part of the initialisation, DefaultLayer.initialize() gets invoked and an instance of RTreeIndex is created for the new layer. RTreeIndex's constructor invokes initIndexMetadata(), which finds an existing metadata node in the DB. The last thing it does is call saveCount(), and that's where the hilarity begins.
The job of saveCount() is to record the number of nodes already indexed in the DB for use in RTree balancing, etc. Since this is a brand new instance (with every request!) it does an exhaustive search of the DB to determine the indexed node count, and saves it in a variable called totalGeometryCount. Unfortunately, this exhaustive search is done with each and every request, which falls into the category of "no big deal" for small indexes, but in my case I have millions of nodes already indexed.
How can I avoid this overhead if I wished to issue a large number spatial queries against a WKT layer containing millions of indexed nodes using Cypher and the spatial procedures provided by the plugin?
+1 - seeing the same performance impact here
I think this is something of a design clash between the original code which was expected to be called in a java app as embedded neo4j, and have the app maintain a reference to the layer, and the newer use of procedures, where the layer is recreated on every procedure (with associated inefficiencies, in particular the saveCount one noted above). I see two ways forward here:
1- Work around this by getting the spatial procedures to hide this history be maintaining a cache of known layers, and avoiding re-initializeing layers. This could suffer from stale cache issues though, and requires some care to get right. 2- Find a better solution to the total node count idea, not requiring maintaining this count in the graph as we do today.
This second one is preferable because it solves another issue we see here, and that is that apparently read-only queries end up doing writes, which is conceptually not really rational. So I think I'll see if I can figure out a way to take this second route. I know I already looked at this issue last year, and did not get a solution then. Hopefully a fresh look will help.
I have loaded neo4j with roughly 3.7M nodes from a CSV dump of a PostGIS DB. The nodes (labeled "JUNCTION") contain a WKT POINT feature in a "geom" key along with a few other small pieces of metadata. Here's a taste of the data:
I then created a WKT spatial layer as follows:
call spatial.addWKTLayer("wktLayer", "geom");
I next added all of the 3.7M nodes to the index by running the following Cypher in a loop until it returned 0:
Next, I attempted a few spatial queries and was surprised to find that a simple BBOX search took over 2 seconds:
All spatial queries seem to take as long. For example:
Thinking the problem was related to the WKT layer, I extracted the x,y coordinates from the POINT WKT and added long, lat float attributes and tried adding them to a simple X,Y point layer, but the performance was no better. When I added more point features to the same layer from CSV dumps of other tables, the response time increased proportionally to the number of nodes added. The point features represent POIs in New Zealand and Australia, so they do tend to be rather tightly clustered around population centers, but I really don't think point density should present an R-Tree with any heartache.
I tried neo4j 3.1.1 and neo4j 3.1.4, each with the corresponding versions of neo4j-spatial. The version made no difference.
I'm running this in Amazon AWS in a Amazon M3.xlarge instance with 4 VCPUs, SSD storage, and 16GB of RAM running Ubuntu 16.04. The machine is dedicated to Neo4J. Subjectively, all other Neo4J functionality seems blazingly fast with the exception of interacting with the spatial plugin. For example, even querying to see which spatial layers are present seems to take an excessive amount of time:
Note that it appears the the R-Tree is well balanced:
Can anyone offer any insight as to where the problem might lie? I'm happy to provide whatever diagnostics or documentation needed to resolve this.