neo4j-contrib / spatial

Neo4j Spatial is a library of utilities for Neo4j that faciliates the enabling of spatial operations on data. In particular you can add spatial indexes to already located data, and perform spatial operations on the data like searching for data within specified regions or within a specified distance of a point of interest. In addition classes are provided to expose the data to geotools and thereby to geotools enabled applications like geoserver and uDig.
http://neo4j-contrib.github.io/spatial
Other
781 stars 192 forks source link

Slow spatial queries with spatial 0.24 on Neo4J 3.1.1 and 3.1.4 #313

Open nw31304 opened 7 years ago

nw31304 commented 7 years ago

I have loaded neo4j with roughly 3.7M nodes from a CSV dump of a PostGIS DB. The nodes (labeled "JUNCTION") contain a WKT POINT feature in a "geom" key along with a few other small pieces of metadata. Here's a taste of the data:

[/var/lib/neo4j/import]# head neo_junction_nodes.csv
feat_id:ID,z_level:int,geom:string,:LABEL
00004143-5400-2800-0000-00000bec976c,0,POINT(149.321225 -35.3373071),JUNCTION
00004143-5400-2800-0000-00000bec976d,1,POINT(149.3836557 -35.3333975),JUNCTION
00004143-5400-2800-0000-00000bec976e,0,POINT(149.1478638 -35.3189651),JUNCTION

I then created a WKT spatial layer as follows:

call spatial.addWKTLayer("wktLayer", "geom");

I next added all of the 3.7M nodes to the index by running the following Cypher in a loop until it returned 0:

 MATCH (x:JUNCTION) WHERE NOT (x)<-[:RTREE_REFERENCE]-() 
 WITH x LIMIT 10000 
 WITH COLLECT(x) as nodes 
 CALL spatial.addNodes("wktLayer",nodes) yield count 
 RETURN count;

Next, I attempted a few spatial queries and was surprised to find that a simple BBOX search took over 2 seconds:

neo4j> call spatial.bbox("wktLayer", {lat: -37.704425, lon: 176.269706}, {lat: -37.701325, lon: 176.273908 }) yield node as d return count(*);
+----------+
| count(*) |
+----------+
| 13       |
+----------+

1 row available after 2091 ms, consumed after another 0 ms

All spatial queries seem to take as long. For example:

 call spatial.withinDistance("wktLayer", {lat: -37.704425, lon: 176.269706}, .3) yield node as d return count(*);
+----------+
| count(*) |
+----------+
| 13       |
+----------+

1 row available after 2016 ms, consumed after another 0 ms

Thinking the problem was related to the WKT layer, I extracted the x,y coordinates from the POINT WKT and added long, lat float attributes and tried adding them to a simple X,Y point layer, but the performance was no better. When I added more point features to the same layer from CSV dumps of other tables, the response time increased proportionally to the number of nodes added. The point features represent POIs in New Zealand and Australia, so they do tend to be rather tightly clustered around population centers, but I really don't think point density should present an R-Tree with any heartache.

I tried neo4j 3.1.1 and neo4j 3.1.4, each with the corresponding versions of neo4j-spatial. The version made no difference.

I'm running this in Amazon AWS in a Amazon M3.xlarge instance with 4 VCPUs, SSD storage, and 16GB of RAM running Ubuntu 16.04. The machine is dedicated to Neo4J. Subjectively, all other Neo4J functionality seems blazingly fast with the exception of interacting with the spatial plugin. For example, even querying to see which spatial layers are present seems to take an excessive amount of time:

neo4j> call spatial.layers;
+-------------------------------------------------------------------------------------------------------------------------+
| name         | signature                                                                                                |
+-------------------------------------------------------------------------------------------------------------------------+
| "wktLayer" | "EditableLayer(name='wktLayer', encoder=WKTGeometryEncoder(geom='geom', bbox='bbox'))" |
+-------------------------------------------------------------------------------------------------------------------------+

1 row available after 3951 ms, consumed after another 0 ms

Note that it appears the the R-Tree is well balanced:

MATCH (:ReferenceNode)-->({ layer: "wktLayer"})-[:RTREE_ROOT]->(root) 
WITH root 
MATCH p = (root) -[:RTREE_CHILD*] ->(child) -[:RTREE_REFERENCE]->(leaf) 
RETURN length(p) as depth, count (*) as freq;

depth, freq
4, 3770889
 MATCH (:ReferenceNode)-->({ layer: "wktLayer"})-[:RTREE_ROOT]->(root) 
 WITH root 
 MATCH path=(root) -[rels:RTREE_CHILD*] ->(child) where size( (child)-[:RTREE_CHILD]->()) = 0 
 WITH length(path) as l 
 RETURN l,count(*);

l, count(*)
3, 54755

Can anyone offer any insight as to where the problem might lie? I'm happy to provide whatever diagnostics or documentation needed to resolve this.

nw31304 commented 7 years ago

Here's a breakdown of the branching factor, partially because it may be useful but mostly because I like running Cypher:

neo4j> MATCH (:ReferenceNode)-->({ layer: "pointLayer"})-[:RTREE_ROOT]->(root) with root match p=(root)-[rels:RTREE_CHILD*0..]->() with length(rels) as len return len,count(*) order by len ;
+----------------+
| len | count(*) |
+----------------+
| 0   | 1        |
| 1   | 26       |
| 2   | 1721     |
| 3   | 71991    |
+----------------+

4 rows available after 304 ms, consumed after another 0 ms
nw31304 commented 7 years ago

OK, I've debugged a but more, and the issue is with the saveCount() method in src/main/java/org/neo4j/gis/spatial/rtree/RTreeIndex.java.

With essentially every (interesting) request to a procedure advertised in the spatial plugin, a call is made to SpatialProcedures.getLayerOrThrow(). This function in turn calls getLayer() on the spatial service, and ultimately a new instance of the layer class is instantiated. As part of the initialisation, DefaultLayer.initialize() gets invoked and an instance of RTreeIndex is created for the new layer. RTreeIndex's constructor invokes initIndexMetadata(), which finds an existing metadata node in the DB. The last thing it does is call saveCount(), and that's where the hilarity begins.

The job of saveCount() is to record the number of nodes already indexed in the DB for use in RTree balancing, etc. Since this is a brand new instance (with every request!) it does an exhaustive search of the DB to determine the indexed node count, and saves it in a variable called totalGeometryCount. Unfortunately, this exhaustive search is done with each and every request, which falls into the category of "no big deal" for small indexes, but in my case I have millions of nodes already indexed.

How can I avoid this overhead if I wished to issue a large number spatial queries against a WKT layer containing millions of indexed nodes using Cypher and the spatial procedures provided by the plugin?

night199uk commented 7 years ago

+1 - seeing the same performance impact here

craigtaverner commented 7 years ago

I think this is something of a design clash between the original code which was expected to be called in a java app as embedded neo4j, and have the app maintain a reference to the layer, and the newer use of procedures, where the layer is recreated on every procedure (with associated inefficiencies, in particular the saveCount one noted above). I see two ways forward here:

1- Work around this by getting the spatial procedures to hide this history be maintaining a cache of known layers, and avoiding re-initializeing layers. This could suffer from stale cache issues though, and requires some care to get right. 2- Find a better solution to the total node count idea, not requiring maintaining this count in the graph as we do today.

This second one is preferable because it solves another issue we see here, and that is that apparently read-only queries end up doing writes, which is conceptually not really rational. So I think I'll see if I can figure out a way to take this second route. I know I already looked at this issue last year, and did not get a solution then. Hopefully a fresh look will help.