paleolimbot / geos

Open Source Geometry Engine ('GEOS') R API
https://paleolimbot.github.io/geos/
Other
61 stars 8 forks source link

st_intersect faster than geos_intersects_matrix() #66

Open vronizor opened 2 years ago

vronizor commented 2 years ago

As per @paleolimbot 's suggestion, here is a reprex of a benchmark comparing spatial intersects with sf vs geos . st_intersects does seem faster than geos_intersects_matrix() at the moment.

library(sf)
#> Linking to GEOS 3.9.1, GDAL 3.4.0, PROJ 8.1.1; sf_use_s2() is TRUE
library(geos)
library(data.table)
library(microbenchmark)
options(timeout=10000)

rm(list = ls())
gc()
#>           used (Mb) gc trigger (Mb) limit (Mb) max used (Mb)
#> Ncells  901867 48.2    1397940 74.7         NA  1397940 74.7
#> Vcells 1578643 12.1    8388608 64.0      16384  2650514 20.3

url_trips = "https://s3.amazonaws.com/nyc-tlc/trip+data/yellow_tripdata_2015-01.csv"

download.file(url = url_trips, destfile = "trip_records.csv", mode = "wb") # takes some time and is pretty heavy!
trips.dt = fread("trip_records.csv", nrows = 100000) # enough rows to benchmark performance

# Keep only pickup (origin) coordinate columns
cols_drop = grep("pickup_l", names(trips.dt), value = T, invert = T)
trips.dt[, (cols_drop) := NULL]
trips.dt = trips.dt[!(pickup_longitude == 0 | pickup_latitude == 0)] # obviously invalid coordinates

# Generate geom
trips.dt[, `:=`(o_geom = geos_make_point(pickup_longitude, 
                                         pickup_latitude, 
                                         crs=4326))]

# Generate sf
trips.sf = st_as_sf(trips.dt)
trips.sf = st_transform(trips.sf, 4326)

# Get polygons to intersect
url_zones = "https://s3.amazonaws.com/nyc-tlc/misc/taxi_zones.zip"
download.file(url_zones, destfile = "zones_polygons.zip", mode = "wb")
unzip("zones_polygons.zip", exdir = file.path("zones"))

# sf version
zones.sf = st_read(file.path("zones", "taxi_zones.shp"))
#> Reading layer `taxi_zones' from data source 
#>   `/private/var/folders/y3/cg53v8tx3g538m3751w1_9v80000gn/T/RtmpwdNjKW/reprex-4d044a0e522a-cilia-stag/zones/taxi_zones.shp' 
#>   using driver `ESRI Shapefile'
#> Simple feature collection with 263 features and 6 fields
#> Geometry type: MULTIPOLYGON
#> Dimension:     XY
#> Bounding box:  xmin: 913175.1 ymin: 120121.9 xmax: 1067383 ymax: 272844.3
#> Projected CRS: NAD83 / New York Long Island (ftUS)
zones.sf = st_transform(zones.sf, 4326)

# geom version
zones.dt = as.data.table(zones.sf)[, geom := as_geos_geometry(geometry, crs = 4326)]
zones.dt[, geometry := NULL]

# Functions to test
geos.dt = function(x) {
  geos_intersects_matrix(zones.dt[, geom], trips.dt[, o_geom])
}

sf = function(x) {
  st_intersects(zones.sf, trips.sf)
}

# Benchmark
microbenchmark(geos = geos.dt(),
               sf = sf(),
               times = 10)
#> Unit: seconds
#>  expr      min       lq     mean   median       uq      max neval
#>  geos 4.217741 4.627104 5.214595 5.222753 5.842305 5.876518    10
#>    sf 3.014916 3.103549 3.493810 3.328212 3.455878 5.066740    10

Created on 2022-02-15 by the reprex package (v2.0.1)

kadyb commented 2 years ago

Shouldn't you use a planar CRS (eg 2263)? It looks like the difference is even greater then.

result = bench::mark(
  check = FALSE, iterations = 10, time_unit = "s",
  geos.dt(),
  sf_4326(),
  sf_2263()
)
result[, 1:6]

  expression   min median `itr/sec` mem_alloc `gc/sec`
1 geos.dt()  4.80   5.05      0.199    3.79MB    2.79 
2 sf_4326()  2.62   2.79      0.351    6.61MB    0.351
3 sf_2263()  0.709  0.730     1.31     5.47MB    1.31

I'm also not sure that {sf} and {geos} work exactly the same way. geos_intersects_matrix() expects STRTree as the second argument and it takes the most time to convert to this type, but @paleolimbot will explain it best.

system.time(trips.dt2 <- as_geos_strtree(trips.dt[, o_geom]))
#> user  system elapsed 
#> 4.88    0.02    4.89
system.time(geos_intersects_matrix(zones.dt[, geom], trips.dt2))
#> user  system elapsed 
#> 0.14    0.00    0.14
paleolimbot commented 2 years ago

Thanks for this!

I'm 99% certain that the difference is because of how accumulating potentially matching features (based on bbox) is implemented (in sf it's std::vector and here I over-allocate an intermediary buffer that's way too big and I should just use std::vector).