r-spatial / sf

Simple Features for R
https://r-spatial.github.io/sf/
Other
1.34k stars 298 forks source link

st_intersection 30 times slower in 0.9_6 compared to 0.7_4 #1510

Closed geogale closed 3 years ago

geogale commented 4 years ago

Hi. I am running the following code:

library(sf)

download.file("https://www.dropbox.com/s/izol20rxf90apye/Areas.gpkg?dl=1",
              destfile = "Areas.gpkg" , mode='wb')
Boundary1 <- st_transform(st_read("Areas.gpkg"),crs=27700)

download.file("https://www.dropbox.com/s/z8pham6w9ldesfy/Local_Authority_Districts__May_2020__Boundaries_UK_BFE-shp.zip?dl=1",
              destfile = "Local_Authority_Districts__May_2020__Boundaries_UK_BFE-shp.zip" , mode='wb')
unzip("Local_Authority_Districts__May_2020__Boundaries_UK_BFE-shp.zip", exdir = ".")
Boundary2 <- st_transform(st_read("Local_Authority_Districts__May_2020__Boundaries_UK_BFE.shp"),crs=27700)

Boundary_Intersection <- st_intersection(Boundary1,Boundary2)
Boundary_Intersection$Area <- as.numeric(st_area(Boundary_Intersection))

The intersection takes about 15 minutes to complete when using R 4.0.2 and sf 0.9_6 but only around 30 seconds when using R 3.4.0 and sf 0.7_4. Is there a reason why it is now a lot slower?

rsbivand commented 4 years ago

Please report GEOS versions for both cases - and your OS, with your installation choices (MacOS or Windows CRAN binaries, source install, etc.). Windows binaries are now using a more recent GEOS, as I think are MacOS. In both cases, sessionInfo() and sf::sf_extSoftVersion(), with a description of how you installed sf and if you built from source, how you installed GEOS.

geogale commented 4 years ago

sf 0.9_6 is using GDAL 3.8.0 and sf 0.7_4 is using GDAL 3.6.1. I am running this on Windows (although, I have 0.7_4/3.6.1 on a Mac and it's just as quick). I am doing this on a work machine that uses JFrog Artifactory to mirror Windows binaries - installing a package looks like this: install.packages("sf", depends=TRUE, type = "win.binary")

rsbivand commented 4 years ago

OK, you mean GEOS, not GDAL. So either something changed in sf, GEOS, or both. Let's see if @edzer can comment before proceeding. I'm assuming that the platforms are otherwise comparable.

geogale commented 4 years ago

Oops! I should have said GDAL 2.2.3 vs 3.0.4

rsbivand commented 4 years ago

I asked about GEOS, not GDAL. Even though sf may do topology predicates and operations through GDAL, it is the version of GEOS that matters. The older variant of CRAN Windows binaries uses https://github.com/rwinlib/gdal2 GEOS 3.6.1, the newer https://github.com/rwinlib/gdal3 GEOS 3.8.0.

edzer commented 4 years ago

sf uses GEOS directly, not through GDAL (with the exception of specifying ops in the query argument to st_read).

rsbivand commented 4 years ago

Sorry, didn't check. Do you know of obvious changes in GEOS or the sf interface to GEOS (in Rcpp?) since early 2018?

edzer commented 4 years ago

I don't see, and cannot recall, any changes made to st_intersection, neither the R nor the C++ code, since april 2019 (release of 0.7-4).

rsbivand commented 4 years ago

Shall I try to repllicate on Win10 - CRAN sf 0.9-6 and source install from GDAL2 of 0.7-4? Or do you have access?

edzer commented 4 years ago

That would be great - I don't have (simple) access to windows, and can't install 0.7-4 on my working machine (requires proj_api.h, no longer avail in PROJ 7).

rsbivand commented 4 years ago

On sf 0.9-7 with development GEOS (with NG overlay I think) - so it could be a validity handling issue - @geogale don't worry about this, we test as-yet unreleased GEOS etc. versions. This is a message, and the inputs and the output are all valid, F32 time 617 seconds:

CBR: result (after common-bits addition) is INVALID: Self-intersection at or near point 368576.69999999995 864696 (368576.69999999995343 864696)
rsbivand commented 4 years ago

On Win10, R 4.0.2 and sf 0.9-6 and GDAL3 GEOS 3.8.0 16.95 min (old T420s laptop) On Win10 old T420s laptop R 4.0.2 and sf 0.7-4 built locally with current Rcpp and Rtools4, GEOS 3.6.1 38 sec. So the regression is very real for this data set. It is not coming from different versions of Rcpp or R itself. I have not succeeded in installing sf 0.9-6 against rwinlib/GDAL2, I think I would have to install a pre-Rtools4 R and Rtools. I'll try to choose GEOS/PROJ/GDAL versions on F32 matching the un-degraded case, so GEOS 3.6.1, GDAL 2.2.3 and PROJ 4.9.3 or similar. Maybe trying on Debian/Ubuntu with GEOS 3.8.0 or 3.8.1 could cross-check to see if my 617 seconds (10.3 min on a much faster laptop) is caused by the NG overlay or was there before.

rsbivand commented 4 years ago

With GEOS 3.9.0dev, sf was 617 seconds for intersection, but rgeos with the same GEOS is 154 seconds for the same intersection (F32).

edzer commented 4 years ago
> system.time(Boundary_Intersection <- st_intersection(Boundary1,Boundary2))
   user  system elapsed 
 27.705   0.007  27.716 
Warning message:
attribute variables are assumed to be spatially constant throughout all geometries 
> sf_extSoftVersion()
          GEOS           GDAL         proj.4 GDAL_with_GEOS     USE_PROJ_H 
       "3.6.0"        "2.2.4"        "4.8.0"        "false"        "false" 
> sessionInfo()
R version 3.6.2 (2019-12-12)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Debian GNU/Linux bullseye/sid

Matrix products: default
BLAS:   /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/liblapack.so.3

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] sf_0.9-7

loaded via a namespace (and not attached):
 [1] Rcpp_1.0.3         class_7.3-15       crayon_1.3.4       dplyr_0.8.3       
 [5] assertthat_0.2.1   grid_3.6.2         R6_2.4.1           DBI_1.1.0         
 [9] magrittr_1.5       e1071_1.7-3        units_0.6-5        pillar_1.4.3      
[13] KernSmooth_2.23-16 rlang_0.4.4        tools_3.6.2        glue_1.3.1        
[17] purrr_0.3.3        compiler_3.6.2     pkgconfig_2.0.3    classInt_0.4-2    
[21] tidyselect_1.0.0   tibble_2.1.3      

indicates to me that the regression is not in sf...

rsbivand commented 4 years ago

And it is a big regression, isn't it? What are the timings for GEOS 3.8.1 (I'm stuck with 3.9.0dev at the moment - do you have 3.8.0 or 3.8.1 running)? It isn't rwinlib either.

edzer commented 4 years ago

With

GEOS 3.8.1, GDAL 3.1.3, PROJ 7.1.1

I see

> system.time(Boundary_Intersection <- st_intersection(Boundary1,Boundary2))
   user  system elapsed 
562.452   0.067 562.728 
Warning message:
attribute variables are assumed to be spatially constant throughout all geometries 

so, yeah, a factor 20.

olcaysah commented 4 years ago

FYI. Below are my values. I actually realized this issue around a month ago. Therefore, I was using st_intersects() instead.

sf_extSoftVersion() GEOS GDAL proj.4 GDAL_with_GEOS USE_PROJ_H "3.8.0" "3.0.4" "6.3.1" "true" "true" sessionInfo() R version 3.6.3 (2020-02-29) Platform: x86_64-w64-mingw32/x64 (64-bit) Running under: Windows 10 x64 (build 18363)

system.time(Boundary_Intersection <- st_intersection(Boundary1,Boundary2)) user system elapsed 1026.41 3.73 1027.08

Best, Olcay

On Tue, Oct 13, 2020 at 9:29 AM Edzer Pebesma notifications@github.com wrote:

With

GEOS 3.8.1, GDAL 3.1.3, PROJ 7.1.1

I see

system.time(Boundary_Intersection <- st_intersection(Boundary1,Boundary2)) user system elapsed 562.452 0.067 562.728 Warning message: attribute variables are assumed to be spatially constant throughout all geometries

so, yeah, a factor 20.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/r-spatial/sf/issues/1510#issuecomment-707777539, or unsubscribe https://github.com/notifications/unsubscribe-auth/AARWS4CL7RIMCJ4NZWUW6YDSKRP3VANCNFSM4SOK4KTA .

-- OLCAY SAHIN

rsbivand commented 4 years ago

Oops: current rgeos with GEOS runtime version: 3.6.3-CAPI-1.10.3 runs 190 seconds, as against 154 seconds with rgeos 3.9.0dev - with overlay-NG I think (all F32, R 4.0.3). With GEOS 3.8.1 (also before overlay-NG), it takes 150 seconds. I'm hesitant to rebuild sf because GDAL is built with GEOS, so I'd have to rebuild GDAL too. I'm not sure that this isn't sf or perhaps Rcpp, which is needed for the interface? I recall that at some point (3.7.?)? it became more necessary to check for validity?

edzer commented 4 years ago

Or it's in parts that sf uses, but rgeos doesn't, like the STRtree?

rsbivand commented 4 years ago

Unfortunately, for R profiling, .Call gets 99.98% of the run time. How could we profile Rcpp::wrap(CPL_geos_op2()) or CPL_geos_op2(), or further back the internals of that function?

edzer commented 4 years ago

I tried compiling with -pg, but that doesn't work with a package; probably need to recompile entire R with -pg (if supported at all).

rsbivand commented 4 years ago

Maybe I can check STRtree separately in rgeos::gUnarySTRtreeQuery() to see? Doesn't look like a smoking gun:

> library(rgeos)
Loading required package: sp
rgeos version: 0.5-5, (SVN revision 640)
 GEOS runtime version: 3.9.0dev-CAPI-1.14.0 
 Linking to sp version: 1.4-4 
 Polygon checking: TRUE 

> load("GEOS_regression_objects.rda")
> system.time(res <- gUnarySTRtreeQuery(a_sp))
   user  system elapsed 
  0.113   0.002   0.116 
> system.time(res <- gUnarySTRtreeQuery(b_sp))
   user  system elapsed 
  0.777   0.016   0.796 
> system.time(res <- gBinarySTRtreeQuery(a_sp, b_sp))
   user  system elapsed 
  0.874   0.007   0.889 
> system.time(res <- gBinarySTRtreeQuery(b_sp, a_sp))
   user  system elapsed 
  0.882   0.013   0.901 
rsbivand commented 4 years ago

This https://stackoverflow.com/questions/23685206/is-my-using-gperftools-to-profile-a-r-script-with-rcpp-correct didn't look helpful. Ask Dirk?

edzer commented 4 years ago

sf 0.7-5 is the earliest version that works without proj_api.h; compiled against current GEOS/GDAL/PROJ it also has the regression:

library(sf)
# Linking to GEOS 3.8.1, GDAL 3.1.3, PROJ 7.1.1
utils::packageVersion("sf")
# [1] ‘0.7.5’
#download.file("https://www.dropbox.com/s/izol20rxf90apye/Areas.gpkg?dl=1",
#              destfile = "Areas.gpkg" , mode='wb')
Boundary1 <- st_transform(st_read("Areas.gpkg"),crs=27700)
# Reading layer `Areas' from data source `/tmp/Areas.gpkg' using driver `GPKG'
# Simple feature collection with 1000 features and 1 field
# geometry type:  MULTIPOLYGON
# dimension:      XY
# bbox:           xmin: -116.1928 ymin: 68218.7 xmax: 640329.7 ymax: 868949.7
# epsg (SRID):    27700
# proj4string:    +proj=tmerc +lat_0=49 +lon_0=-2 +k=0.9996012717 +x_0=400000 +y_0=-100000 +ellps=airy +units=m +no_defs

#download.file("https://www.dropbox.com/s/z8pham6w9ldesfy/Local_Authority_Districts__May_2020__Boundaries_UK_BFE-shp.zip?dl=1",
#              destfile = "Local_Authority_Districts__May_2020__Boundaries_UK_BFE-shp.zip" , mode='wb')
#unzip("Local_Authority_Districts__May_2020__Boundaries_UK_BFE-shp.zip", exdir = ".")
Boundary2 <- st_transform(st_read("Local_Authority_Districts__May_2020__Boundaries_UK_BFE.shp"),crs=27700)
# Reading layer `Local_Authority_Districts__May_2020__Boundaries_UK_BFE' from data source `/tmp/Local_Authority_Districts__May_2020__Boundaries_UK_BFE.shp' using driver `ESRI Shapefile'
# Simple feature collection with 379 features and 10 fields
# geometry type:  MULTIPOLYGON
# dimension:      XY
# bbox:           xmin: -116.1928 ymin: 5333.81 xmax: 655989 ymax: 1220302
# epsg (SRID):    27700
# proj4string:    +proj=tmerc +lat_0=49 +lon_0=-2 +k=0.9996012717 +x_0=400000 +y_0=-100000 +ellps=airy +units=m +no_defs
system.time(Boundary_Intersection <- st_intersection(Boundary1, Boundary2))
#    user  system elapsed 
# 581.509   0.383 582.846 
# Warning message:
# attribute variables are assumed to be spatially constant throughout all geometries 

Is one of you able to time the sf 0.7-5 binary on Windows?

rsbivand commented 4 years ago

I'll try that in a moment and report back. The st_transform() steps are not needed as the input is in OSGB National Grid anyway.

However, after asking on geos-devel how to build with overlay-ng (I thought it was enabled in master by default but wasn't) https://lists.osgeo.org/pipermail/geos-devel/2020-October/009750.html, I have:

> library(rgeos)
Loading required package: sp
rgeos version: 0.5-5, (SVN revision 640)
 GEOS runtime version: 3.9.0dev-CAPI-1.14.0 
 Linking to sp version: 1.4-4 
 Polygon checking: TRUE 

> load("GEOS_regression_objects.rda")
> system.time(bI <- gIntersection(a_sp, b_sp, byid=TRUE))
   user  system elapsed 
 12.970   0.157  13.264 
> system.time(bI <- gIntersection(a_sp, b_sp, byid=TRUE))
   user  system elapsed 
 12.591   0.018  12.654 
> library(sf)
Linking to GEOS 3.9.0dev, GDAL 3.2.0dev-b51dc28ab7, PROJ 7.2.0
> a_sf <- st_as_sf(a_sp)
> b_sf <- st_as_sf(b_sp)
> system.time(bIa <- st_intersection(a_sf, b_sf))
   user  system elapsed 
 16.414   0.098  16.558 
Warning message:
attribute variables are assumed to be spatially constant throughout all geometries 
> packageVersion("sf")
[1] ‘0.9.7’

So maybe GEOS 3.7 introduced the regression but overlay-ng will not only remove it, but will also be faster. I still need to run checks and revdep checks.

Maybe 3.9 will build on Solaris (3.7 and 3.8 are broken)??

rsbivand commented 4 years ago

I can only install sf 0.7-5 on Windows with rwinlib/GDAL2, so PROJ 4.9.3. I tried to mix GDAL2 and GDAL3 yesterday without success, as GDAL2 is Rtools < 4, and GDAL3 >= 4 IIUC. I don't think the sf change is related to proj.h. I'll try to trap the regression by GEOS releases, I suspect 3.6. are fast, and will check whether the regression was 3.7. or 3.8. . Since 3.9.0dev is faster anyway (than 3.6.), maybe we'll be lucky?

rsbivand commented 4 years ago

I cannot detect the regression through rgeos, unfortunately:

> library(rgeos)
Loading required package: sp
rgeos version: 0.5-5, (SVN revision 640)
 GEOS runtime version: 3.6.0-CAPI-1.10.0
 Linking to sp version: 1.4-4
 Polygon checking: TRUE

> load("GEOS_regression_objects.rda")
> system.time(bI <- gIntersection(a_sp, b_sp, byid=TRUE))
   user  system elapsed
185.260   0.303 195.412

> library(rgeos)
Loading required package: sp
rgeos version: 0.5-5, (SVN revision 640)
 GEOS runtime version: 3.6.3-CAPI-1.10.3
 Linking to sp version: 1.4-4 
 Polygon checking: TRUE 

> load("GEOS_regression_objects.rda")
> system.time(bI <- gIntersection(a_sp, b_sp, byid=TRUE))

   user  system elapsed 
187.596   0.320 195.193

> library(rgeos)
Loading required package: sp
rgeos version: 0.5-5, (SVN revision 640)
 GEOS runtime version: 3.7.0-CAPI-1.11.0
 Linking to sp version: 1.4-4
 Polygon checking: TRUE

> load("GEOS_regression_objects.rda")
> system.time(bI <- gIntersection(a_sp, b_sp, byid=TRUE))
   user  system elapsed
183.152   0.261 191.275

To try with sf, I'd need to re-build GDAL for each GEOS version too. Looks more like container instances with a succession of version sets.

edzer commented 4 years ago

I built sf 0.7-5 in a docker against

> library(sf)
Linking to GEOS 3.6.0, GDAL 2.2.4, PROJ 4.8.0
> utils::packageVersion("sf")
[1] ‘0.7.5’

and get

> system.time(Boundary_Intersection <- st_intersection(Boundary1,Boundary2))
   user  system elapsed 
 28.033   0.000  28.035 
Warning message:
attribute variables are assumed to be spatially constant throughout all geometries 

This is the identical sf package code (downloaded from CRAN) that gave me 600 secs here. I'm a bit at a loss what to try next.

rsbivand commented 4 years ago

Could you try Overlay-NG - a recent nightly (I had 9 October on my desktop, master from git on gitea on laptop), ../configure --enable-overlayng, and see if you also get the speed-up I'm seeing. As rgeos seems not to see a regression for gIntersection(), maybe it isn't GEOS but an interaction between sf, Rcpp and GEOS (and here I think Rcpp is constant).

edzer commented 4 years ago

With overlayng-enabled GEOS compiled from github, and

> library(sf)
Linking to GEOS 3.9.0dev, GDAL 3.1.1, PROJ 7.1.0

I see

> system.time(Boundary_Intersection <- st_intersection(Boundary1,Boundary2))
   user  system elapsed 
 16.668   0.040  16.711 
Warning message:
attribute variables are assumed to be spatially constant throughout all geometries 
rsbivand commented 4 years ago

Very good! This is the same improvement that I see. My revdeps are running, so I'll see whether there are any breakages attributable to Overlay-NG.

edzer commented 4 years ago

And the other issue seems to be related to precision, default set in rgeos to 1e8, set to zero (not set) in sf. When setting to 1e8:

library(sf)
# Linking to GEOS 3.8.1, GDAL 3.1.3, PROJ 7.1.1
utils::packageVersion("sf")
# [1] ‘0.9.7’
#download.file("https://www.dropbox.com/s/izol20rxf90apye/Areas.gpkg?dl=1",
#              destfile = "Areas.gpkg" , mode='wb')
Boundary1 <- st_set_precision(st_transform(st_read("Areas.gpkg"),crs=27700),
    1e8)
# Reading layer `Areas' from data source `/tmp/Areas.gpkg' using driver `GPKG'
# Simple feature collection with 1000 features and 1 field
# geometry type:  MULTIPOLYGON
# dimension:      XY
# bbox:           xmin: -116.1928 ymin: 68218.7 xmax: 640329.7 ymax: 868949.7
# projected CRS:  OSGB 1936 / British National Grid

#download.file("https://www.dropbox.com/s/z8pham6w9ldesfy/Local_Authority_Districts__May_2020__Boundaries_UK_BFE-shp.zip?dl=1",
#              destfile = "Local_Authority_Districts__May_2020__Boundaries_UK_BFE-shp.zip" , mode='wb')
#unzip("Local_Authority_Districts__May_2020__Boundaries_UK_BFE-shp.zip", exdir = ".")
Boundary2 <- st_set_precision(st_transform(st_read("Local_Authority_Districts__May_2020__Boundaries_UK_BFE.shp"),crs=27700),
    1e8)
# Reading layer `Local_Authority_Districts__May_2020__Boundaries_UK_BFE' from data source `/tmp/Local_Authority_Districts__May_2020__Boundaries_UK_BFE.shp' using driver `ESRI Shapefile'
# Simple feature collection with 379 features and 10 fields
# geometry type:  MULTIPOLYGON
# dimension:      XY
# bbox:           xmin: -116.1928 ymin: 5333.81 xmax: 655989 ymax: 1220302
# projected CRS:  OSGB 1936 / British National Grid
system.time(Boundary_Intersection <- st_intersection(Boundary1, Boundary2))
#    user  system elapsed 
#  22.916   0.038  22.962 
# Warning message:
# attribute variables are assumed to be spatially constant throughout all geometries 
edzer commented 4 years ago

For me, that closes the issue.

edzer commented 4 years ago

... but not before mentioning #1482 and @dbaston - this has taken both Roger and me several hours, not for the first time, and would not have been needed if sf had used the same default precision as rgeos.

geogale commented 4 years ago

For me, that closes the issue.

@edzer just to confirm my layperson understanding. Are you suggesting setting the precision to 1e8 for all st_intersectionoperations? Obviously with my example it's a known problem, but with other examples with larger volumes of data it might not be feasible to try with and without setting the precision to see if it makes a time difference.

dbaston commented 4 years ago

Yes, perturbing the inputs can have a large effect on algorithm runtime. Most of the time it will probably improve. In other cases it will cause the overlay to fail entirely or return an incorrect result.

rsbivand commented 4 years ago

With 1e8, 3.9.0dev runs 10 seconds, as against 17 seconds with default precision. So Overlay-NG seems to help anyway, but for < 3.9.0, users (of GEOS 3.7 and 3.8 ??) need to be advised to try setting precision to 1e8 if run times are excessive for binary (?) topological operations.

edzer commented 4 years ago

@geogale setting precision to 1e8 means (with coordinates having units m) that all coordinates are rounded to the nearest multiple of 10 nm before computing the intersection. As @dbaston mentions, it might be a good thing to always do so, but it also might be another cause for problems.

Robinlovelace commented 4 years ago

Following this thread with interest, don't understand every part of it but very grateful that this has been discovered and documented for the benefit of everyone working with medium-large datasets. From @edzer's comment above, I guess that this would mean running a command such as

st_precision(Boundary1) = 0.00000001

and the same for Boundary2 would speed-up the operation on machines that show the potential regression, as per https://r-spatial.github.io/sf/reference/st_precision.html

If a default setting like that has the potential for factor 30 speed-ups, and given most users will not know about let alone read this issue, my first impression is: in favour.

edzer commented 4 years ago

Robin, I set it to 1e8, not 1e-8. You can indeed set it to units::set_units(10, nm), which is equivalent to setting it to 1e8.

Robinlovelace commented 4 years ago

Aha, so:

st_precision(Boundary1) = 1e8

It's not clear to me from the example in https://r-spatial.github.io/sf/reference/st_precision.html what appropriate values are. Is it worth updating that help page with an example that is closer to a real-world scenario?

Robinlovelace commented 4 years ago

P.s. also just spotted that there is the equivalent code above.

Boundary2 <- st_set_precision(st_transform(st_read("Local_Authority_Districts__May_2020__Boundaries_UK_BFE.shp"),crs=27700),
    1e8)
edzer commented 4 years ago

I agree that for instance

"Local_Authority_Districts__May_2020__Boundaries_UK_BFE.shp" %>%
  st_read() %>%
  st_transform(crs=27700) %>%
  st_set_precision(units::set_units(10, nm)) ->
  Boundary2

is a more readable form.

The documentation of precision refers here, did you make it there?

Robinlovelace commented 4 years ago

Both are readable and it's partly a matter of style, I think style is a continuum and that both ways of writing it represents different ends of the continuum. I would probably have gone for an intermediary approach, but definitely wasn't aiming to comment on the style, just the substance.

In answer to your question, no I didn't make it there and probably should have checked. From there I can see the key quote:

to specify 3 decimal places of precision, use a scale factor of 1000...

Worth saying something like that in the docs for st_precision() also? Happy to give it a bash if that would help.

Robinlovelace commented 4 years ago

For what it's worth I had a think about how to type that 'tidy pipeline' and came up with this, a balance that is readable and concise?

f = "Local_Authority_Districts__May_2020__Boundaries_UK_BFE.shp"
Boundary2_original = st_read(f)
Boundary2 = Boundary2_original
  st_transform(crs=27700) %>%
  st_set_precision(units::set_units(10, nm)) 

I didn't know about the st_set_precision(units::set_units(10, nm)) notation. That is really neat!

Robinlovelace commented 4 years ago

Also as a heads-up @geogale, I wrote a function that you can point to one of the datasets provided by data.gov.uk/ESRI and it auto downloads/unzips the shapefile + tries to read it in a single command - take a look if of use/interest, I've been using it a lot of late: https://github.com/Robinlovelace/ukboundaries/blob/master/R/ukborders-package.R#L6

Reprex:

library(sf)
#> Linking to GEOS 3.8.0, GDAL 3.0.4, PROJ 6.3.1
u = "https://opendata.arcgis.com/datasets/910f48f3c4b3400aa9eb0af9f8989bbe_0.zip?outSR=%7B%22latestWkid%22%3A27700%2C%22wkid%22%3A27700%7D"
Boundary2_original = ukboundaries::duraz(u)
#> Using default data cache directory ~/.ukboundaries/cache 
#> Use cache_dir() to change it.
#> Reading layer `Local_Authority_Districts__May_2020__UK_BUC' from data source `/tmp/RtmpEfD8o1/reprex174185e970dd2/Local_Authority_Districts__May_2020__UK_BUC.shp' using driver `ESRI Shapefile'
#> Simple feature collection with 379 features and 10 fields
#> geometry type:  MULTIPOLYGON
#> dimension:      XY
#> bbox:           xmin: -116.1928 ymin: 7054.1 xmax: 655653.8 ymax: 1218618
#> projected CRS:  OSGB 1936 / British National Grid
#> Removing the following files: Local_Authority_Districts__May_2020__UK_BUC.dbf Local_Authority_Districts__May_2020__UK_BUC.prj Local_Authority_Districts__May_2020__UK_BUC.shp Local_Authority_Districts__May_2020__UK_BUC.shx
Boundary2 = Boundary2_original %>% 
  st_transform(crs=27700) %>%
  st_set_precision(units::set_units(10, nm)) 
plot(Boundary2)

Created on 2020-10-14 by the reprex package (v0.3.0)

geogale commented 4 years ago

Thanks @Robinlovelace - I have no doubt that's been well used lately! I may be reading the data from the place where it lives before ending up in an Esri interface - but that would be telling!

rsbivand commented 4 years ago

Fun (?) fact: GEOS 3.9.0 with Overlay-NG generates the same slivers, but reports them in different order, see https://github.com/r-spatial/asdar-book.org/commit/144f3d573ee60db09347b7190a099e8b79e4c7ac. Same changed order of output breaks geozoning, https://github.com/uribo/kuniezu/issues/6, https://github.com/ropensci/plotly/issues/1862, sf itself https://github.com/r-spatial/sf/issues/1512, https://github.com/hypertidy/sfdct/issues/13.

dr-jts commented 3 years ago

Yes, perturbing the inputs (by using a precision value) can have a large effect on algorithm runtime. Most of the time it will probably improve. In other cases it will cause the overlay to fail entirely or return an incorrect result.

A note that this behaviour changes in GEOS 3.9 with OverlayNG. OverlayNG should generally run faster if floating (= no) precision is specified, since it can use faster noding algorithms. Conversely, specifying a precision invokes snap-rounding noding, which is generally slower (anywhere from 1.5x to 3x). So the recommendation to use a precision model should probably be revisited (once confirmed by experience by the sf community).