mstrimas / smoothr

Spatial Feature Smoothing in R
http://strimas.com/smoothr
GNU General Public License v3.0
95 stars 5 forks source link

drop_crumbs very slow for large data #10

Open MikkoVihtakari opened 3 years ago

MikkoVihtakari commented 3 years ago

Thanks for a great package!

I noticed that drop_crumbs is very slow for large datasets because you seem to use a for loop in it. How about vectorizing it instead? Something along the lines:

library(smoothr)
#> 
#> Attaching package: 'smoothr'
#> The following object is masked from 'package:stats':
#> 
#>     smooth
library(sf)
#> Linking to GEOS 3.8.1, GDAL 3.2.1, PROJ 7.2.1

data(jagged_polygons, package = "smoothr")

p <- rep(jagged_polygons$geometry[7], 10000)
area_thresh <- units::set_units(200, km^2)

system.time(p_dropped <- drop_crumbs(p, threshold = area_thresh))
#>    user  system elapsed 
#>  87.139   5.219  93.293

system.time({
  p2 <- sf::st_cast(p, "POLYGON")
  p2_dropped <- sf::st_union(p2[sf::st_area(p2) >= area_thresh,])
})
#> although coordinates are longitude/latitude, st_union assumes that they are planar
#>    user  system elapsed 
#>  16.486   0.166  16.773

par(mfrow = c(1,2))
plot(p_dropped, col = "red")
plot(p2_dropped, col = "blue")

Created on 2021-05-31 by the reprex package (v2.0.0)

The example makes only one MULTIPOLYGON instead of 10000 by your code, but I am sure one can find a way to combine them back to 10000 polygons.

mstrimas commented 3 years ago

Thanks for pointing this out! I'm using the looping approach because I want to maintain the exact same structure of the sf object before and after the call to smoothr functions. However, given that it's significantly faster to use a vectorized approach, sounds like I should look into dropping the for loop and finding a way to still maintain the structure of the features. I'll investigate and update things here if I make any progress.

MikkoVihtakari commented 3 years ago

Thanks. I am/was using the functions here: https://github.com/MikkoVihtakari/ggOceanMaps/blob/master/R/vector_bathymetry.R

With the current versions, it takes up to 30 min to process one raster (if one uses very large raster_bathymetry and all 3 smoothr functions. Without smoothr functions, the vectorizing takes a few minutes at most.

Perhaps @edzer would have thoughts on how to retain the polygons in a vectorized version or you may ask in StackOverflow. They tend to be very helpful with such things.

mstrimas commented 3 years ago

Ok, good to know, I didn't realize you were using smoothr in that way. I will say this package was not written for speed or for large datasets. I think I can make some tweaks that will help a bit, e.g. your suggestion to vectorize drop_crumbs(), but the smoothing is bound to be slow since it requires processing individual polygons, which I don't think can easily be sped up significantly unless it gets implemented in C or something like that. I'll keep looking into this, but if speed is critical you may want to put a warning in your package not to use smoothing for large datasets, or something to that effect.

MikkoVihtakari commented 3 years ago

Lapply loop may help. I can look at it when I'll find time.