Performance issue with`GEOSPreparedContains`

chrispahm commented 9 months ago

While working on a WebAssembly build of GEOS, I encountered a performance issue with GEOSPreparedContains while benchmarking various implementations of the geospatial contains function.

I compared the performance of GEOS (using the C-API) with other geospatial libraries, such as Turf.js (JS), geo (Rust), and vectorized Shapely (Python). I used the small scale Natural Earth 1:110m land dataset as the polygon, and a set of 1000 random points to test against the polygon. I measured the time to perform the point-in-polygon test for all points, excluding the time to load the dataset and create the geometries.

The results are shown in the table below:

Implementation of point-in-polygon test	Time (ms, lower is better)
GEOS-WASM (JS)	2074
Shapely (Python)	1814
GEOS (C-API)	1784
Turf.js (JS)	29
geo (Rust)	3

(Tested on a 2021 14" MacBook Pro M1, the source code for each tests can be found in this repo: https://github.com/chrispahm/contains-benchmark/tree/main/src )

As you can see, all GEOS backed implementations (Wasm, Shapely, C) performed the task significantly slower than the other libraries Turf.js and geo. This seems relatively strange, given that the test is relatively simple (127 polygons against 1000 points). I was wondering if there's a known reason for this performance gap, or maybe some trick that could be used to speed it up?

I also tried to use the GEOSPreparedContainsXY function instead of GEOSPreparedContains, as suggested in this issue, but it did not make a measurable difference in this case.

I am not trying to blame or criticize anyone with this issue, but rather to provide some constructive feedback and hopefully help to improve the library. I understand that GEOS is a complex and mature project, and that there may be trade-offs or limitations that I am not aware of. I also acknowledge that this benchmark is not comprehensive or representative of all use cases of GEOS. I just found it difficult to believe that a pure Javascript library performs this task faster than compiled C 😊

I am happy to provide more details or code samples if needed!

dbaston commented 9 months ago

The problem is that the GEOS "prepared" implementation is never invoked because the argument to GEOSPrepare is a GeometryCollection, not a Polygon or MultiPolygon. GEOS does not know that the input is strictly polygonal and therefore a point-in-polygon algorithm can be used. If your input is recast as a MultiPolygon, e.g.:

  GEOSGeoJSONReader *reader = GEOSGeoJSONReader_create();
  const GEOSGeometry *collection = GEOSGeoJSONReader_readGeometry(reader, polygon_str);
  GEOSGeometry* polygon = GEOSUnaryUnion(collection);
  GEOSGeom_destroy(collection);

then you should see quite different performance:

./out/contains.outc ./data/10000_random_points.geojson ./data/ne_10m_land.geojson
34

Maybe GEOSPrepare could inspect collection inputs and try to see if they can be represented differently? Or it could simply fail if its inputs cannot be prepared? Not sure about that.

dr-jts commented 9 months ago

Maybe GEOSPrepare could inspect collection inputs and try to see if they can be represented differently?

Note that due to the well-known quirk of the OGC contains and within semantic ("polygons do not contain their boundary"), testing for contains against polygonal inputs may require performing expensive boundary tests. In particular, in a GeometryCollection polygons may overlap, which means that checking for inclusion in the (effective) boundary is more complex. Usually a better approach is to test intersects or covers, which have simpler semantics allowing better optimization.

A test could be made for a collection containing a single Polygon or MultiPolygon. It seems to me too complex to test to see if a collection of Polygons is a valid MultiPolygon so that contains can be optimized.

Or it could simply fail if its inputs cannot be prepared?

I think the current "best-effort" semantics are simpler to use. Perhaps the documentation could provide more guidance on when specific inputs might not provide a performance boost (however, this is a bit of a moving target).

I'm working on a new implementation of the topology predicate algorithm (called RelateNG) which will support more extensive optimizations, including cases similar to this (i.e. a GeometryCollection containing a single Polygon against Point inputs).

dr-jts commented 9 months ago

It's impressive that Rust geo is so fast, since AFAICS it does not provide any optimization for repeated evaluation against the same input (i.e. a prepared equivalent using a spatial index). Or am I missing something?

dbaston commented 9 months ago

That's a good point about "contains", though calling GEOSPrepare on a GeometryCollection is still a silent no-op for less complicated predicates like "intersects". I think that's a confusing result (as this ticket shows), but I guess the behavior is too longstanding to error instead.

Better would be to have a PreparedGeometryCollection that implements intersects(). The lack of a non-owning collection types makes the implementation a little tricky.

dr-jts commented 9 months ago

Better would be to have a PreparedGeometryCollection that implements intersects().

Yes, that's the best extension of the current implementation. interescts is fairly easy. Other predicates will require evaluation of topology at node points, which is a fair bit more complex.

The lack of a non-owning collection types makes the implementation a little tricky.

How so?

dbaston commented 9 months ago

How so?

I had been thinking of just extracting the points, lines, and polygons from the GeometryCollection and implementing PreparedGeometryCollection::intersects as PreparedPoint::intersects() || PreparedLineString::intersects() || PreparedPolygon::intersects(). But with the current Geometry API you can't do that without copying the inputs into new collections.

chrispahm commented 9 months ago

Thanks @dbaston and @dr-jts for your replies and explanations!

The problem is that the GEOS "prepared" implementation is never invoked because the argument to GEOSPrepare is a GeometryCollection, not a Polygon or MultiPolygon.

Apart from code changes, I think a note in the GEOSPrepare docs stating that "prepared" methods are only invoked with Polygons and MultiPolygons would be helpful. Sorry if it's already there and I missed it, but if not I could do a PR if desired.

Or it could simply fail if its inputs cannot be prepared?

As a user, I agree with @dr-jts that the current semantics are easier to use!

calling GEOSPrepare on a GeometryCollection is still a silent no-op for less complicated predicates like "intersects". I think that's a confusing result (as this ticket shows), but I guess the behavior is too longstanding to error instead.

Just an idea, but maybe a notice message could be issued in this case? I wouldn't have known that my call to GEOSPreparedContains never actually invokes the prepared implementation with a GeometryCollection if it wasn't for this issue.

A test could be made for a collection containing a single Polygon or MultiPolygon. It seems to me too complex to test to see if a collection of Polygons is a valid MultiPolygon so that contains can be optimized.

Sorry if this is naïve, but would it possibly make sense to make this test when the GeometryCollection is created? E.g. checking if a GeometryCollection is homogenous of type Polygon/MultiPolygon/etc., and storing that info as part of the geometry. The "prepared" methods (and potentially others) could then check for the homogenous and GeomType properties and use that for subsequent optimizations.

@dbaston I tried your suggestion of first creating a GEOSUnaryUnion of the GeometryCollection, and preparing the resulting geometry (in C, Py, JS). Testing the unionized polygon against 1,000,000 points I get the following results:

That's great!

However, despite the explanation above I still haven't understood why exactly GEOSPreparedContains is much slower if I call it directly with the GeometryCollection than if I check each geometry of the GeometryCollection against each point individually though. In this case the performance is much better even without using the GEOSPreparedContains function. I just tried that in JS using GEOS-WASM, which takes ~100 ms for completing, as opposed to the ~2080 ms shown above .

const points = pointsFeatureCollection
  .features
  .map(f => geojsonToGeosGeom(f, geos));
const polygon = geojsonToGeosGeom(polygonFeatureCollection, geos);
const prepared_polygon = geos.GEOSPrepare(polygon);

let start = performance.now();
const result_fast = points.map(p => 
  [...Array(geos.GEOSGetNumGeometries(polygon)).keys()]
    .some(i => geos.GEOSContains(geos.GEOSGetGeometryN(polygon, i), p))
);
// takes ~100 ms
let end = performance.now();
console.log(~~(end - start));

start = performance.now();
const result_slow = points.map(p => geos.GEOSPreparedContains(prepared_polygon, p) ? true : false);
// takes ~2080 ms
end = performance.now();
console.log(~~(end - start));

dr-jts commented 9 months ago

Apart from code changes, I think a note in the GEOSPrepare docs stating that "prepared" methods are only invoked with Polygons and MultiPolygons would be helpful. Sorry if it's already there and I missed it, but if not I could do a PR if desired.

The doc should say that GeometryCollections may not be optimized as prepared geometries.

dr-jts commented 9 months ago

Just an idea, but maybe a notice message could be issued in this case? I wouldn't have known that my call to GEOSPreparedContains never actually invokes the prepared implementation with a GeometryCollection if it wasn't for this issue.

Personally I'm opposed to notice messages, since they also may never be noticed, in headless operation. It seems to me that documentation is enough. You need to know how the API works in order to use it effectively.

dr-jts commented 9 months ago

However, despite the explanation above I still haven't understood why exactly GEOSPreparedContains is much slower if I call it directly with the GeometryCollection than if I check each geometry of the GeometryCollection against each point individually though.

This is because the algorithm(s) for prepared geometries are not designed to optimize GeometryCollection inputs. Instead, the call is simply passed on to the regular contains operation. And this in turn is not efficient at evaluating GeometryCollection inputs. I am a bit surprised that your brute-force code is so much faster than the GEOS internal code, so maybe there is something awry with that code. But it was never designed to optimize this case.

dbaston commented 8 months ago

I am a bit surprised that your brute-force code is so much faster than the GEOS internal code, so maybe there is something awry with that code.

I think the two methods are not equivalent, for the reason you mentioned above:

In particular, in a GeometryCollection polygons may overlap, which means that checking for inclusion in the (effective) boundary is more complex.

With the looping method above you can avoid computing topology on the entire collection and return true as soon as you find a containing polygon.

dbaston commented 8 months ago

I had been thinking of just extracting the points, lines, and polygons from the GeometryCollection and implementing PreparedGeometryCollection::intersects as PreparedPoint::intersects() || PreparedLineString::intersects() || PreparedPolygon::intersects(). But with the current Geometry API you can't do that without copying the inputs into new collections.

I guess another potential implementation is to have PreparedGeometryCollection maintain arrays of PreparedPolygon, PreparedLineString and PreparedPoint. This would not require copying everything. It might not be optimal but is almost certainly better than the status quo.

libgeos / geos

Performance issue with`GEOSPreparedContains` #1024