harryprince / geospark

bring sf to spark in production
https://github.com/harryprince/geospark/wiki
57 stars 17 forks source link

Availability of Spatial Indexes #17

Open rexdouglass opened 4 years ago

rexdouglass commented 4 years ago

I'm experimenting with geospark and find the spatial joins slower than expected.

I've set geospark.join.gridtype to "kdbtree" in my configuration below.

Is there something else I need to do to enable or use spatial indexes when creating, saving, or joining on parquets with a geom column?

library(tidyverse)
library(sparklyr)
library(geospark)
conf <- spark_config()

conf$`sparklyr.cores.local` <- 48
conf$spark.serializer <- "org.apache.spark.serializer.KryoSerializer"
conf$spark.kryo.registrator <- "org.datasyslab.geospark.serde.GeoSparkKryoRegistrator"
conf$spark.kryoserializer.buffer.max <- "2047MB" #Caused by: java.lang.IllegalArgumentException: spark.kryoserializer.buffer.max must be less than 2048 mb, got: + 10240 mb.
conf$geospark.join.gridtype <- "kdbtree"

conf$spark.driver.maxResultSize <- "30G"
conf$spark.memory.fraction <- 0.9
conf$spark.executor.heartbeatInterval <-"6000000s"# "10000000s"
conf$spark.network.timeout <- "6000001s"
conf$spark.local.dir <- "/media/skynet2/905884f0-7546-4273-9061-12a790830beb/spark_temp/"
conf$spark.worker.cleanup.enabled <- "true"
conf$"sparklyr.shell.driver-memory"= "300G"
conf$'spark.driver.maxResultSize' <- 0 #0 is ulimmited
sc <- spark_connect(master = "local", config = conf,
                    version = "2.3.3" #for geospark
) 
sc <- register_gis(sc)
rexdouglass commented 4 years ago

Another theory I have is the spatial join is keeping copies of both the A and B geoms. When the B geom is large, it's making many redundant unnecessary copies of it. Setting keep=F doesn't alter the behavior

E.g. d3 <- d1 %>% left_join(d2 %>% rename(geom_b=geom), sql_on = sql("ST_Within(geom_a, geom_b)"), keep=F) %>% dplyr::select(-geom_b,-geom_a)

New experiment, rewriting the sql to remove the geoms in the first select as opposed to after the fact does speed up things but not more than 5%. Most of the overhead is still somewhere else.

sql <- "SELECT `LHS`.`osm_id` AS `osm_id`, `RHS`.`geoboundaries_adm0_shapeID_noversion` 
FROM `d1` AS `LHS`
LEFT JOIN (SELECT `geoboundaries_adm0_shapeName`, `geoboundaries_adm0_shapeID_noversion`, `geoboundaries_adm0_shapeISO`, `geom` AS `geom_b`
FROM `d2`) `RHS`
ON (ST_Within(geom_a, geom_b))"

d3 <- sql %>%
  dbplyr::sql() %>%
  tbl(sc, .)
rexdouglass commented 4 years ago

My last experiment improved performance into ranges I was expecting, the right hand side polygons are very large and detailed (national boundaries).

Hitting them with mutate(geom=ST_SimplifyPreserveTopology(geom,.001)) increased the speed of the join into the sublinear range in a way that makes me think that it's using a spatial index though I might still not have that configured correctly.

Setting the simplification threshold to 0.1 or 0.01 looked visually ok but threw "com.vividsolutions.jts.geom.TopologyException: side location conflict" errors when used in the join.

I hope this is useful to others, and that there are other performance tuning steps I missed that people can recommend.

Even with the simplified polygons, my left join of 480k points within 198 possible country polygons looks like it's going to take 6 hours on a 48 thread machine.

harryprince commented 4 years ago

@rexdouglass it seems you do not turn on the spatial index function.

notice: the bigger table should be in the first place, and the small one follows.

this function works well on our production environment like this:

  SELECT fence_id,top_poi_id,fence_df.geoshape as fence_geoshape,roadid,roadname,roaduid,wkt,'1' as is_isolation 
        FROM road_df,fence_df  
        WHERE  ST_Intersects(road_df.geoshape,fence_df.geoshape) 

here is another tip: ST_Intersects is better than ST_Within and ST_Contains when you do not clearly know the order between two geometries.