Open rexdouglass opened 4 years ago
Another theory I have is the spatial join is keeping copies of both the A and B geoms. When the B geom is large, it's making many redundant unnecessary copies of it. Setting keep=F doesn't alter the behavior
E.g.
d3 <- d1 %>% left_join(d2 %>% rename(geom_b=geom), sql_on = sql("ST_Within(geom_a, geom_b)"), keep=F) %>% dplyr::select(-geom_b,-geom_a)
New experiment, rewriting the sql to remove the geoms in the first select as opposed to after the fact does speed up things but not more than 5%. Most of the overhead is still somewhere else.
sql <- "SELECT `LHS`.`osm_id` AS `osm_id`, `RHS`.`geoboundaries_adm0_shapeID_noversion`
FROM `d1` AS `LHS`
LEFT JOIN (SELECT `geoboundaries_adm0_shapeName`, `geoboundaries_adm0_shapeID_noversion`, `geoboundaries_adm0_shapeISO`, `geom` AS `geom_b`
FROM `d2`) `RHS`
ON (ST_Within(geom_a, geom_b))"
d3 <- sql %>%
dbplyr::sql() %>%
tbl(sc, .)
My last experiment improved performance into ranges I was expecting, the right hand side polygons are very large and detailed (national boundaries).
Hitting them with mutate(geom=ST_SimplifyPreserveTopology(geom,.001))
increased the speed of the join into the sublinear range in a way that makes me think that it's using a spatial index though I might still not have that configured correctly.
Setting the simplification threshold to 0.1 or 0.01 looked visually ok but threw "com.vividsolutions.jts.geom.TopologyException: side location conflict" errors when used in the join.
I hope this is useful to others, and that there are other performance tuning steps I missed that people can recommend.
Even with the simplified polygons, my left join of 480k points within 198 possible country polygons looks like it's going to take 6 hours on a 48 thread machine.
@rexdouglass it seems you do not turn on the spatial index function.
notice: the bigger table should be in the first place, and the small one follows.
this function works well on our production environment like this:
SELECT fence_id,top_poi_id,fence_df.geoshape as fence_geoshape,roadid,roadname,roaduid,wkt,'1' as is_isolation
FROM road_df,fence_df
WHERE ST_Intersects(road_df.geoshape,fence_df.geoshape)
here is another tip: ST_Intersects is better than ST_Within and ST_Contains when you do not clearly know the order between two geometries.
I'm experimenting with geospark and find the spatial joins slower than expected.
I've set geospark.join.gridtype to "kdbtree" in my configuration below.
Is there something else I need to do to enable or use spatial indexes when creating, saving, or joining on parquets with a geom column?