r-spatial / s2

Spherical Geometry Operators Using the S2 Geometry Library
https://r-spatial.github.io/s2/
72 stars 9 forks source link

expose ways of iterating over vectors of s2 geometries? #125

Closed rsbivand closed 2 years ago

rsbivand commented 3 years ago

I've started looking at supplementing code for finding point neighbours within distance thresholds on the sphere in spdep through sf::st_is_within_distance(), which is OK. However, to return distances for the chosen neighbours on the sphere, in line 64 in https://github.com/r-spatial/spdep/blob/master/R/nbdists.R I iterate over subsets of points through lapply() - see also the examples at the foot of https://r-spatial.github.io/spdep/reference/dnearneigh.html - it would be better to iterate in compiled code rather than use lapply(). Is LinkingTo feasible?

Finding k-nearest neighbours on the sphere looks even more complex for k > 1. Is there any such query in the s2 code itself? I looked at the documentation but couldn't see obvious opportunities, though I think they may be there somewhere, given that indexing is being used.

Current implementations use legacy C code for spdep::dnearneigh(), spdep::nbdists() and spdep::knearneigh() with Great Circle WGS84 ellipsoid distances and no indexing.

Any suggestions would be very welcome!

edzer commented 3 years ago

http://s2geometry.io/devguide/cpp/quickstart suggests that S2ClosestPointQuery can do both knn and distance-based queries, indexed.

paleolimbot commented 3 years ago

First, I think might find s2::s2_closest_edges() useful, which, despite the name, can do k-nearest neighbours (see https://github.com/r-spatial/s2/issues/111#issuecomment-835606024 ). That could be modified to be more useful if it doesn't do exactly what you're trying to do.

Second, while it will probably be a long time before it's stable enough to be public, I've started a C API for internal use ( https://github.com/r-spatial/s2/blob/master/src/s2-c-api.cpp ). If S2 is going to gain traction it needs drop-in GEOS functions. Right now I'm just poking away at that as I run into things I need to implement in C...if you have ideas for things you'd like to see there I'd love ideas.

edzer commented 3 years ago

Looks like it! Here is an example use:

library(sf)
# Linking to GEOS 3.9.0, GDAL 3.2.1, PROJ 7.2.1
library(s2)
up = function(n) {
  x = runif(n)
  y = runif(n)
  st_as_sf(data.frame(x = x, y = y), coords = c("x", "y")) 
}
u1 = up(1000)
s2 = st_as_s2(u1)
s2_closest_edges(s2[1:10], s2, k = 3)
# [[1]]
# [1] 28 85  1

# [[2]]
# [1] 688 406   2

# [[3]]
# [1]  57 958   3

# [[4]]
# [1] 760 975   4

# [[5]]
# [1] 586 224   5

# [[6]]
# [1] 429 661   6

# [[7]]
# [1] 795 664   7

# [[8]]
# [1] 205  75   8

# [[9]]
# [1] 274 502   9

# [[10]]
# [1] 370 487  10
rsbivand commented 3 years ago

Thanks, implemented and pushed to github. With prototypes in place, I'll try to check timings (which look OK for distance and k-nearest neighbours, not OK for finding the distances for known neighbours - will look at multicore as a way round) and differences from the legacy code.

paleolimbot commented 3 years ago

If you give me a reprex of what is slow I can look into how to make it faster. Seems like a common/reasonable use case to make sure s2 can handle!

rsbivand commented 3 years ago

Thanks for being willing to take a look. This is an implicit reprex via remotes::install_github("r-spatial/spdep"):

library(spdep)
data(state)
us48.fipsno <- read.geoda(system.file("etc/weights/us48.txt",
 package="spdep")[1])
if (as.numeric(paste(version$major, version$minor, sep="")) < 19) {
 m50.48 <- match(us48.fipsno$"State.name", state.name)
} else {
 m50.48 <- match(us48.fipsno$"State_name", state.name)
}
xy <- as.matrix(as.data.frame(state.center))[m50.48,]
xy1 <- st_as_sf((as.data.frame(state.center))[m50.48,], coords=1:2,
  crs=st_crs("+proj=longlat +ellps=GRS80"))
old_use_s2 <- sf_use_s2()
sf_use_s2(TRUE)
gck1b <- knn2nb(knearneigh(xy1, k=1))
system.time(o <- nbdists(gck1b, xy1))
sf_use_s2(FALSE)
gck1c <- knn2nb(knearneigh(xy1, k=1))
system.time(o <- nbdists(gck1c, xy1))
sf_use_s2(old_use_s2)

and maybe debug(nbdists) to see:

https://github.com/r-spatial/spdep/blob/531b0d5e830653bb04197780f1cf4ffd1937aec7/R/nbdists.R#L61-L66

with a list of indices passed to nbdists() as "nb" (gck1* above) used to iterate over the distances to report (for point i, return distances (finally in "km") to neighbours id i in nb[[i]]. It did get a good deal faster after doing the coercion to "s2" from "sfc" once, not n times, but the n subsets and runs of s2_distance() in R seem to be costly. Please ask if this isn't clear. I think the same look-ups are what gstat would need too, for non-default nmax=, min= etc. for local kriging, after finding the k-nearest neighbours and where the metric distance is needed. In nbdists(), the "km" metric can be assigned in a separate lapply(), I think with little cost.

paleolimbot commented 3 years ago

I think you can get away with a single s2_distance() and a single units::set_units() call by flattening the neighbours object temporarily:

library(s2)
n <- 1e3
k <- 4
s2x <- as_s2_geography(s2_point(runif(n, -1, 1), runif(n, -1, 1), runif(n, -1, 1)))
nb <- s2_closest_edges(s2x, s2x, k + 1, min_distance = 0)

d1 <- function() {
    lapply(seq_along(nb), function(i) {
    s2::s2_distance(s2x[i], s2x[nb[[i]]])
  })
}

d2 <- function() {
  # you can abbreviate this if you know all the lengths are the same
  nb_flat_len <- vapply(nb, length, integer(1))
  dist_unnest <- s2::s2_distance(
    vctrs::vec_rep_each(s2x, nb_flat_len),
    s2x[unlist(nb)]
  )

  # maybe a faster way to do this nesting
  di_end <- cumsum(nb_flat_len)
  di_start <- c(1L, di_end[-length(di_end)] + 1L)
  d2 <- Map("[", list(dist_unnest), Map(":", di_start, di_end))
}

identical(d1(), d2())
#> [1] TRUE

bench::mark(d1(), d2())
#> Warning: Some expressions had a GC in every iteration; so filtering is disabled.
#> # A tibble: 2 x 6
#>   expression      min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 d1()        93.24ms 101.96ms      9.94    6.38MB     17.9
#> 2 d2()         5.96ms   6.47ms    143.    529.28KB     11.9

Created on 2021-06-13 by the reprex package (v0.3.0)

rsbivand commented 3 years ago

Thanks! This is my 1970s Algol/Fortran playbook! A longer vector with start/end points - it should work fine. I'll need to work around length 1 neighbour vectors with 0L signalling no-neighbours on the "nb" object, but that is commonplace. Teaching today, will report back later on progress.

rsbivand commented 3 years ago

I have something of a puzzle:

> library(sf)
Linking to GEOS 3.10.0dev, GDAL 3.3.0, PROJ 8.0.1
> usc71k <- st_read("df_tracts.gpkg")
Reading layer `df_tracts' from data source 
  `/home/rsb/und/uam21/units_2/df_tracts.gpkg' using driver `GPKG'
Simple feature collection with 71353 features and 28 fields
Geometry type: MULTIPOLYGON
Dimension:     XY
Bounding box:  xmin: -124.7332 ymin: 24.5447 xmax: -66.9499 ymax: 49.38436
Geodetic CRS:  NAD83
> usc71k_pts <- st_centroid(st_geometry(usc71k), of_largest_polygon=TRUE)
library(spdep)
system.time(k1 <- knn2nb(knearneigh(usc71k_pts, k=1)))
> library(spdep)
Loading required package: sp
Loading required package: spData
> system.time(k1 <- knn2nb(knearneigh(usc71k_pts, k=1)))
   user  system elapsed 
  6.006   0.026   6.405 
> system.time(k6 <- knn2nb(knearneigh(usc71k_pts, k=6)))
system.time(o <- nbdists(k1, usc71k_pts))
(all.linked <- max(unlist(o)))
   user  system elapsed 
  8.192   0.022   8.533 
> system.time(o <- nbdists(k1, usc71k_pts))
   user  system elapsed 
  0.583   0.001   0.594 
> (all.linked <- max(unlist(o)))
[1] 95.09906
> system.time(gc_nb <- dnearneigh(usc71k_pts, 0, all.linked+(.Machine$double.eps^(1/2))))
    user   system  elapsed 
3401.182    2.298 3431.844 
> st_write(usc71k_pts, "usc71k_pts.gpkg")
Writing layer `usc71k_pts' to data source `usc71k_pts.gpkg' using driver `GPKG'
Writing 71353 features with 0 fields and geometry type Point.

s2-based nbdists() and knearneigh() look pretty good (improved after dropping units, which turned out to be costly and not needed). But dnearneigh() is slower than the old GC by almost an order of magnitude, even after changing from st_is_within_distance() to s2_dwithin_matrix() (recent commit/push). Is s2_dwithin_matrix() using indexing? Points file (centroids of 71k US census tracts): usc71k_pts.zip

paleolimbot commented 3 years ago

I've also found s2_dwithin() (regular or matrix) to be slow! It should be using indexing but might not be in the way I would expect.

As a workaround, I've used s2_buffer_cells() + s2_intersects() to preselect prior to running s2_dwithin(). (see https://argocanada.github.io/blog/posts/2021-06-04-finding-coastal-argo-trajectories/ , towards thee top).

rsbivand commented 3 years ago

Thanks. The case here is, however, finding neighbours within dist of each other. Maybe I'll rather try k-nearest neighbours then subset by distance, because s2::s2_closest_edges() obviously does use the indices. BruteForceMatrixPredicateOperator looks a bit brutal in cpp_s2_dwithin_matrix(), doesn't it?

paleolimbot commented 3 years ago

I think there's an easy option for that!

https://github.com/r-spatial/s2/blob/master/src/s2-matrix.cpp#L175

Can be updated to include

https://github.com/r-spatial/s2/blob/master/inst/include/s2/s2closest_edge_query.h#L135

paleolimbot commented 3 years ago

It might be too late to put this in version 1.0.6 (I'm running revdep checks now), but it will certainly make the next release.

paleolimbot commented 3 years ago

It's probably also worth checking if there's a faster way of doing the current dwithin, which should be much faster than it is.

https://github.com/r-spatial/s2/blob/1162a252d7b2953ccff1ab3df5541165af063c9c/src/s2-predicates.cpp#L134-L136

rsbivand commented 3 years ago

OK - I'll experiment with k-nearest neighbours, but that could be messy with this data set, there are very many distance neighbours in the urban areas compared to the longest distance between first nearest neighbours. I'll also look at buffering and intersecting, which seemed wasteful on a small data set but may use indexing.

rsbivand commented 3 years ago

With s2_buffer_cell() and s2_intersects_matrix(), still 2x legacy GC distances, but that is parity, because legacy code uses symmetry to halve the number of measured distances; s2_buffer_cell() needs a good deal of memory for 71k buffers:

> library(sf)
Linking to GEOS 3.10.0dev, GDAL 3.3.0, PROJ 8.0.1
> usc71k_pts <- st_read("usc71k_pts.gpkg")
Reading layer `usc71k_pts' from data source 
  `/home/rsb/und/uam21/units_2/usc71k_pts.gpkg' using driver `GPKG'
Simple feature collection with 71353 features and 0 fields
Geometry type: POINT
Dimension:     XY
Bounding box:  xmin: -124.6272 ymin: 24.54924 xmax: -67.01773 ymax: 48.9887
Geodetic CRS:  NAD83
> library(spdep)
Loading required package: sp
Loading required package: spData
> system.time(k1 <- knn2nb(knearneigh(usc71k_pts, k=1)))
   user  system elapsed 
  5.483   0.043   5.782 
> system.time(k6 <- knn2nb(knearneigh(usc71k_pts, k=6)))
   user  system elapsed 
  7.136   0.005   7.237 
> system.time(o <- nbdists(k1, usc71k_pts))
   user  system elapsed 
  0.570   0.009   0.582 
> (all.linked <- max(unlist(o)))
[1] 95.09906
> system.time(gc_nb <- dnearneigh(usc71k_pts, 0, all.linked+(.Machine$double.eps^(1/2))))
    user   system  elapsed 
1183.389    1.912 1193.172 
> sf_use_s2()
[1] TRUE
> sf_use_s2(FALSE)
Spherical geometry (s2) switched off
> system.time(gc_nb1 <- dnearneigh(usc71k_pts, 0, all.linked+(.Machine$double.eps^(1/2))))
   user  system elapsed 
613.710   0.003 616.100 
> table(table(card(gc_nb)))

  1   2   3   4   5   6   7   8   9  10  11  12  13  14  15  16  17  18  19  20 
761 470 328 243 184 155 140 120  97 114  75  94  69  83  78  51  56  44  55  48 
 21  22  23  24  25  26  27  28  29  30  31  32  33  34  35  36  37  38  39  40 
 51  40  38  29  43  37  33  28  23  18  25  23  12  20  24  19  18  13  18  26 
 41  42  43  44  45  46  47  48  49  50  51  52  53  54  55  56  57  58  59  60 
 18  23  20  19  16  13  25  17  15  27  31  17  15  28  14  22  20  19  13  13 
 61  62  63  64  65  66  67  68  69  70  71  72  73  74  75  76  77  78  79  80 
 12  16  13  14  11  10  14  16   6   5   9   5   7   5   5   6   7   3   1   6 
 81  82  83  84  85  86  87  88  89  91  92  93  95  96  97  98  99 100 101 102 
  2   2   1   4   2   5   1   3   3   2   3   1   1   3   1   1   1   1   1   1 
103 104 105 107 111 112 116 121 133 141 177 202 247 
  2   3   2   1   1   1   1   1   1   1   1   1   1 
> table(table(card(gc_nb1)))

  1   2   3   4   5   6   7   8   9  10  11  12  13  14  15  16  17  18  19  20 
785 481 312 225 165 168 139 112 122 101  89  86  75  75  55  69  54  46  37  49 
 21  22  23  24  25  26  27  28  29  30  31  32  33  34  35  36  37  38  39  40 
 57  31  41  37  45  32  27  28  43  23  27  11  21  21  20  16  14  11  15  21 
 41  42  43  44  45  46  47  48  49  50  51  52  53  54  55  56  57  58  59  60 
 23  24  22  14  18  25  22  13  27  21  25  17  24  16  23  13  19  13  14  21 
 61  62  63  64  65  66  67  68  69  70  71  72  73  74  75  76  77  78  79  80 
 14  10  10  16   7  14  16   6   7   8  12   6   3   8   7   5   1   7   4   4 
 81  82  83  84  85  86  87  88  89  90  91  92  95  96  97  98  99 100 101 102 
  4   4   2   4   4   4   1   2   3   3   2   4   3   1   2   1   2   1   1   1 
103 105 109 112 113 114 121 124 131 145 215 221 237 
  1   1   1   1   1   1   1   1   1   1   1   1   1 
rsbivand commented 3 years ago

The buffering can be speeded up by reducing max_cells=, but it doesn't really make sense as it affects accuracy. Finding out how to benefit from indexing to subset the by-geometry distance calculations, and how to use distance symmetry would be interesting at some point.

edzer commented 3 years ago

but it doesn't really make sense as it affects accuracy.

Not sure: I believe that the cells are guaranteed to include the actual (smooth) buffer, so less cells = larger area.

paleolimbot commented 3 years ago

Yes, the buffered area is the exterior covering as currently implemented. I'm planning on implementing an interior covering as well when there's a vector class for an S2CellUnion to match the vector class for S2Cell. This should reduce memory usage for this technique considerably!

rsbivand commented 3 years ago

Yes, larger area in the buffer seems to mean that points are included by intersection with the buffer that are more than dist from the point of interest, doesn't it?

paleolimbot commented 3 years ago

Yes...you'll have to refilter the points using s2_dwithin() afterward (hopefully much faster).

rsbivand commented 3 years ago

Yes, this looks possible:

> library(sf)
Linking to GEOS 3.10.0dev, GDAL 3.3.0, PROJ 8.0.1
> usc71k_pts <- st_read("usc71k_pts.gpkg")
Reading layer `usc71k_pts' from data source 
  `/home/rsb/und/uam21/units_2/usc71k_pts.gpkg' using driver `GPKG'
Simple feature collection with 71353 features and 0 fields
Geometry type: POINT
Dimension:     XY
Bounding box:  xmin: -124.6272 ymin: 24.54924 xmax: -67.01773 ymax: 48.9887
Geodetic CRS:  NAD83
> s2x <- st_as_s2(usc71k_pts)
> system.time(buf250 <- s2::s2_buffer_cells(s2x, dist=96000, max_cells=250))
   user  system elapsed 
217.071   0.008 217.662 
> system.time(i250 <- s2::s2_intersects_matrix(buf250, s2x))
   user  system elapsed 
151.617   0.000 151.948 
> system.time(i250_clip <- lapply(seq_along(i250), function(i) {
+ i250[[i]][s2::s2_dwithin(s2x[i], s2x[i250[[i]]], dist=96000)]}))
   user  system elapsed 
 61.187   0.000  61.341 

For 71k points, max_cells 500 was 450s, 250 220s, with the intersects_matrix and dwithin inside lapply not varying. Can intersects_matrix or buffer_cells be parallelized, I think so? max_cells 150 is 115s for buffer, intersects increasing to 165s, lapply/dwithin 63s, so multicore looks feasible with maybe max_cells 200.

paleolimbot commented 3 years ago

Perhaps wait a tiny bit for me to catch up on the additions I mentioned above...the cell union in particular I think will make most of this a mute point! At the very least worth trying before delving into the complexities of multicore.

rsbivand commented 3 years ago

In the embarrassingly parallel cases, mclapply() did help, but you see the power of indexing in the k-nearest neighbour case at foot:

> library(sf)
Linking to GEOS 3.10.0dev, GDAL 3.3.0, PROJ 8.0.1
> usc71k_pts <- st_read("usc71k_pts.gpkg")
Reading layer `usc71k_pts' from data source 
  `/home/rsb/und/uam21/units_2/usc71k_pts.gpkg' using driver `GPKG'
Simple feature collection with 71353 features and 0 fields
Geometry type: POINT
Dimension:     XY
Bounding box:  xmin: -124.6272 ymin: 24.54924 xmax: -67.01773 ymax: 48.9887
Geodetic CRS:  NAD83
> library(spdep)
Loading required package: sp
Loading required package: spData
> set.coresOption(4L)
NULL
# 4 cores for buffering (need to output to a list of 4 `"sfc"` objects as the s2 pointers get lost on return),
# intersects_matrix and dwithin on the points intersecting with the buffers:
> system.time(nb_buf4 <- dnearneigh(usc71k_pts, 0, 96, avoid_s2=FALSE, dwithin=FALSE))
   user  system elapsed 
427.038   7.204 141.189 
# single core legacy 
> system.time(nb_leg <- dnearneigh(usc71k_pts, 0, 96, avoid_s2=TRUE))
   user  system elapsed 
548.467   0.017 549.419 
# 4 core dwithin_matrix, even slower than single core legacy
> system.time(nb_dw4 <- dnearneigh(usc71k_pts, 0, 96, avoid_s2=FALSE, dwithin=TRUE))
    user   system  elapsed 
3395.818    5.389  871.282 
# buffer/intersects_matrix/dwithin gives same neighbours as dwithin_matrix
> all.equal(nb_buf4, nb_dw4, check.attributes=FALSE)
[1] TRUE
> nb_dw4
Neighbour list object:
Number of regions: 71353 
Number of nonzero links: 83035752 
Percentage nonzero weights: 1.63095 
Average number of links: 1163.732 
# single core knn with silly k
> system.time(nb_k1163 <- knearneigh(usc71k_pts, k=1163))
   user  system elapsed 
 59.290   0.228  60.006 
# single core knn with realistic k
> system.time(nb_k1163 <- knearneigh(usc71k_pts, k=6))
   user  system elapsed 
  5.121   0.000   5.139 

Both knn are far faster than legacy; distances between neighbours are also much faster. So now there is a clear basis for testing changes in s2 - these were run on the repo state as of this morning CEST. Using the input 71k tract boundaries single core and guiding legacy snapped contiguity detection by sf intersects between bounding boxes to establish candidate lists, with NA crs, 25s. Maybe I should revisit this and assign back an input GEOGCRS before doing the intersects - done, no particular increase in time using s2_intersects_matrix(), maybe 5s; does find fewer neighbours, though.

rsbivand commented 3 years ago

Following up and reading https://github.com/r-spatial/discuss/issues/43, and passing s2_model="closed" to sf::st_intersects.sfc(), I see that the differences occur where, for non-s2 settings, bounding box intersections are found and candidate neighbours listed, they are missed if the edges of the bounding boxes only touch, as in the small south-eastern tract here: Screenshot 2021-06-19 at 14-21-18 Screenshot It is a candidate neighbour with NA or planar crs, not a candidate neighbour when using of s2_intersects_matrix() in st_intersects(): Screenshot 2021-06-19 at 14-33-32 Screenshot Screenshot 2021-06-19 at 14-35-16 Screenshot

Reprex: bbs.zip

library(sf)
bbs <- st_read("bbs.gpkg")
(s2_case <- sf::st_intersects(bbs))
s2::s2_intersects_matrix(bbs, bbs)
st_crs(bbs) <- as.character(NA)
(NA_case <- sf::st_intersects(bbs))

The 6th bounding box does not intersect the 1st bounding box when longlat crs, does when NA crs.

> (s2_case <- sf::st_intersects(bbs))
Sparse geometry binary predicate list of length 6, where the predicate
was `intersects'
 1: 1, 2, 3, 4, 5
 2: 1, 2, 3, 5, 6
 3: 1, 2, 3, 4
 4: 1, 3, 4, 5
 5: 1, 2, 4, 5, 6
 6: 2, 5, 6
> st_crs(bbs) <- as.character(NA)
> (NA_case <- sf::st_intersects(bbs))
Sparse geometry binary predicate list of length 6, where the predicate
was `intersects'
 1: 1, 2, 3, 4, 5, 6
 2: 1, 2, 3, 5, 6
 3: 1, 2, 3, 4
 4: 1, 3, 4, 5
 5: 1, 2, 4, 5, 6
 6: 1, 2, 5, 6

Is this just the s2_model - I've tried most of ?s2::s2_snap_level, or am I missing other possibilities?

paleolimbot commented 3 years ago

I'll look properly in a bit, but could it be that the edges of the bounding box are being interpreted as geodesic edges (the case in s2)? You could try a segmentize with NA crs before the intersection to test.

rsbivand commented 3 years ago

Thanks. Further, these are the bounding box corner points:

image

I tried st_segmentize() with decreasing values of dfMaxLength=, but the gains in intersection accuracy are small, and memory use increases rapidly. I think on balance that since many US census tracts have grid edges, they are a very hard case as the bounding boxes do not overlap as much as with more irregular geometries. Interesting to try, though.

rsbivand commented 3 years ago

I looked more carefully, and if no bounding box intersection is used, just st_intersects() on the input polygons, then both in the planar and spherical case the problem is resolved. So the default will be to use "sfc" polygons (or coerce to them), st_intersects(), remove the repeated contiguities in the lower triangle, and check for 1 (Queen) or 2 (Rook) boundary points within snap=.

paleolimbot commented 3 years ago

That's good to know - it makes sense that cartesian bounding boxes are unlikely to play nicely with s2. An s2 cell union (e.g., s2_buffer_cells()) is a more spherical-friendly way to simplify an arbitrary polygon if you need to do so.

paleolimbot commented 2 years ago

I've opened up some issues for the action items in here...open up other issues if I missed any!