Closed rsbivand closed 2 years ago
http://s2geometry.io/devguide/cpp/quickstart suggests that S2ClosestPointQuery
can do both knn and distance-based queries, indexed.
First, I think might find s2::s2_closest_edges()
useful, which, despite the name, can do k-nearest neighbours (see https://github.com/r-spatial/s2/issues/111#issuecomment-835606024 ). That could be modified to be more useful if it doesn't do exactly what you're trying to do.
Second, while it will probably be a long time before it's stable enough to be public, I've started a C API for internal use ( https://github.com/r-spatial/s2/blob/master/src/s2-c-api.cpp ). If S2 is going to gain traction it needs drop-in GEOS functions. Right now I'm just poking away at that as I run into things I need to implement in C...if you have ideas for things you'd like to see there I'd love ideas.
Looks like it! Here is an example use:
library(sf)
# Linking to GEOS 3.9.0, GDAL 3.2.1, PROJ 7.2.1
library(s2)
up = function(n) {
x = runif(n)
y = runif(n)
st_as_sf(data.frame(x = x, y = y), coords = c("x", "y"))
}
u1 = up(1000)
s2 = st_as_s2(u1)
s2_closest_edges(s2[1:10], s2, k = 3)
# [[1]]
# [1] 28 85 1
# [[2]]
# [1] 688 406 2
# [[3]]
# [1] 57 958 3
# [[4]]
# [1] 760 975 4
# [[5]]
# [1] 586 224 5
# [[6]]
# [1] 429 661 6
# [[7]]
# [1] 795 664 7
# [[8]]
# [1] 205 75 8
# [[9]]
# [1] 274 502 9
# [[10]]
# [1] 370 487 10
Thanks, implemented and pushed to github. With prototypes in place, I'll try to check timings (which look OK for distance and k-nearest neighbours, not OK for finding the distances for known neighbours - will look at multicore as a way round) and differences from the legacy code.
If you give me a reprex of what is slow I can look into how to make it faster. Seems like a common/reasonable use case to make sure s2 can handle!
Thanks for being willing to take a look. This is an implicit reprex via remotes::install_github("r-spatial/spdep")
:
library(spdep)
data(state)
us48.fipsno <- read.geoda(system.file("etc/weights/us48.txt",
package="spdep")[1])
if (as.numeric(paste(version$major, version$minor, sep="")) < 19) {
m50.48 <- match(us48.fipsno$"State.name", state.name)
} else {
m50.48 <- match(us48.fipsno$"State_name", state.name)
}
xy <- as.matrix(as.data.frame(state.center))[m50.48,]
xy1 <- st_as_sf((as.data.frame(state.center))[m50.48,], coords=1:2,
crs=st_crs("+proj=longlat +ellps=GRS80"))
old_use_s2 <- sf_use_s2()
sf_use_s2(TRUE)
gck1b <- knn2nb(knearneigh(xy1, k=1))
system.time(o <- nbdists(gck1b, xy1))
sf_use_s2(FALSE)
gck1c <- knn2nb(knearneigh(xy1, k=1))
system.time(o <- nbdists(gck1c, xy1))
sf_use_s2(old_use_s2)
and maybe debug(nbdists)
to see:
https://github.com/r-spatial/spdep/blob/531b0d5e830653bb04197780f1cf4ffd1937aec7/R/nbdists.R#L61-L66
with a list of indices passed to nbdists()
as "nb"
(gck1*
above) used to iterate over the distances to report (for point i
, return distances (finally in "km") to neighbours id i
in nb[[i]]
. It did get a good deal faster after doing the coercion to "s2"
from "sfc"
once, not n
times, but the n
subsets and runs of s2_distance()
in R seem to be costly. Please ask if this isn't clear. I think the same look-ups are what gstat would need too, for non-default nmax=
, min=
etc. for local kriging, after finding the k-nearest neighbours and where the metric distance is needed. In nbdists()
, the "km" metric can be assigned in a separate lapply()
, I think with little cost.
I think you can get away with a single s2_distance()
and a single units::set_units()
call by flattening the neighbours object temporarily:
library(s2)
n <- 1e3
k <- 4
s2x <- as_s2_geography(s2_point(runif(n, -1, 1), runif(n, -1, 1), runif(n, -1, 1)))
nb <- s2_closest_edges(s2x, s2x, k + 1, min_distance = 0)
d1 <- function() {
lapply(seq_along(nb), function(i) {
s2::s2_distance(s2x[i], s2x[nb[[i]]])
})
}
d2 <- function() {
# you can abbreviate this if you know all the lengths are the same
nb_flat_len <- vapply(nb, length, integer(1))
dist_unnest <- s2::s2_distance(
vctrs::vec_rep_each(s2x, nb_flat_len),
s2x[unlist(nb)]
)
# maybe a faster way to do this nesting
di_end <- cumsum(nb_flat_len)
di_start <- c(1L, di_end[-length(di_end)] + 1L)
d2 <- Map("[", list(dist_unnest), Map(":", di_start, di_end))
}
identical(d1(), d2())
#> [1] TRUE
bench::mark(d1(), d2())
#> Warning: Some expressions had a GC in every iteration; so filtering is disabled.
#> # A tibble: 2 x 6
#> expression min median `itr/sec` mem_alloc `gc/sec`
#> <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl>
#> 1 d1() 93.24ms 101.96ms 9.94 6.38MB 17.9
#> 2 d2() 5.96ms 6.47ms 143. 529.28KB 11.9
Created on 2021-06-13 by the reprex package (v0.3.0)
Thanks! This is my 1970s Algol/Fortran playbook! A longer vector with start/end points - it should work fine. I'll need to work around length 1 neighbour vectors with 0L signalling no-neighbours on the "nb"
object, but that is commonplace. Teaching today, will report back later on progress.
I have something of a puzzle:
> library(sf)
Linking to GEOS 3.10.0dev, GDAL 3.3.0, PROJ 8.0.1
> usc71k <- st_read("df_tracts.gpkg")
Reading layer `df_tracts' from data source
`/home/rsb/und/uam21/units_2/df_tracts.gpkg' using driver `GPKG'
Simple feature collection with 71353 features and 28 fields
Geometry type: MULTIPOLYGON
Dimension: XY
Bounding box: xmin: -124.7332 ymin: 24.5447 xmax: -66.9499 ymax: 49.38436
Geodetic CRS: NAD83
> usc71k_pts <- st_centroid(st_geometry(usc71k), of_largest_polygon=TRUE)
library(spdep)
system.time(k1 <- knn2nb(knearneigh(usc71k_pts, k=1)))
> library(spdep)
Loading required package: sp
Loading required package: spData
> system.time(k1 <- knn2nb(knearneigh(usc71k_pts, k=1)))
user system elapsed
6.006 0.026 6.405
> system.time(k6 <- knn2nb(knearneigh(usc71k_pts, k=6)))
system.time(o <- nbdists(k1, usc71k_pts))
(all.linked <- max(unlist(o)))
user system elapsed
8.192 0.022 8.533
> system.time(o <- nbdists(k1, usc71k_pts))
user system elapsed
0.583 0.001 0.594
> (all.linked <- max(unlist(o)))
[1] 95.09906
> system.time(gc_nb <- dnearneigh(usc71k_pts, 0, all.linked+(.Machine$double.eps^(1/2))))
user system elapsed
3401.182 2.298 3431.844
> st_write(usc71k_pts, "usc71k_pts.gpkg")
Writing layer `usc71k_pts' to data source `usc71k_pts.gpkg' using driver `GPKG'
Writing 71353 features with 0 fields and geometry type Point.
s2-based nbdists()
and knearneigh()
look pretty good (improved after dropping units, which turned out to be costly and not needed). But dnearneigh()
is slower than the old GC by almost an order of magnitude, even after changing from st_is_within_distance()
to s2_dwithin_matrix()
(recent commit/push). Is s2_dwithin_matrix()
using indexing? Points file (centroids of 71k US census tracts):
usc71k_pts.zip
I've also found s2_dwithin()
(regular or matrix) to be slow! It should be using indexing but might not be in the way I would expect.
As a workaround, I've used s2_buffer_cells()
+ s2_intersects()
to preselect prior to running s2_dwithin()
. (see https://argocanada.github.io/blog/posts/2021-06-04-finding-coastal-argo-trajectories/ , towards thee top).
Thanks. The case here is, however, finding neighbours within dist
of each other. Maybe I'll rather try k-nearest neighbours then subset by distance, because s2::s2_closest_edges()
obviously does use the indices. BruteForceMatrixPredicateOperator
looks a bit brutal in cpp_s2_dwithin_matrix()
, doesn't it?
I think there's an easy option for that!
https://github.com/r-spatial/s2/blob/master/src/s2-matrix.cpp#L175
Can be updated to include
https://github.com/r-spatial/s2/blob/master/inst/include/s2/s2closest_edge_query.h#L135
It might be too late to put this in version 1.0.6 (I'm running revdep checks now), but it will certainly make the next release.
It's probably also worth checking if there's a faster way of doing the current dwithin, which should be much faster than it is.
OK - I'll experiment with k-nearest neighbours, but that could be messy with this data set, there are very many distance neighbours in the urban areas compared to the longest distance between first nearest neighbours. I'll also look at buffering and intersecting, which seemed wasteful on a small data set but may use indexing.
With s2_buffer_cell()
and s2_intersects_matrix()
, still 2x legacy GC distances, but that is parity, because legacy code uses symmetry to halve the number of measured distances; s2_buffer_cell()
needs a good deal of memory for 71k buffers:
> library(sf)
Linking to GEOS 3.10.0dev, GDAL 3.3.0, PROJ 8.0.1
> usc71k_pts <- st_read("usc71k_pts.gpkg")
Reading layer `usc71k_pts' from data source
`/home/rsb/und/uam21/units_2/usc71k_pts.gpkg' using driver `GPKG'
Simple feature collection with 71353 features and 0 fields
Geometry type: POINT
Dimension: XY
Bounding box: xmin: -124.6272 ymin: 24.54924 xmax: -67.01773 ymax: 48.9887
Geodetic CRS: NAD83
> library(spdep)
Loading required package: sp
Loading required package: spData
> system.time(k1 <- knn2nb(knearneigh(usc71k_pts, k=1)))
user system elapsed
5.483 0.043 5.782
> system.time(k6 <- knn2nb(knearneigh(usc71k_pts, k=6)))
user system elapsed
7.136 0.005 7.237
> system.time(o <- nbdists(k1, usc71k_pts))
user system elapsed
0.570 0.009 0.582
> (all.linked <- max(unlist(o)))
[1] 95.09906
> system.time(gc_nb <- dnearneigh(usc71k_pts, 0, all.linked+(.Machine$double.eps^(1/2))))
user system elapsed
1183.389 1.912 1193.172
> sf_use_s2()
[1] TRUE
> sf_use_s2(FALSE)
Spherical geometry (s2) switched off
> system.time(gc_nb1 <- dnearneigh(usc71k_pts, 0, all.linked+(.Machine$double.eps^(1/2))))
user system elapsed
613.710 0.003 616.100
> table(table(card(gc_nb)))
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
761 470 328 243 184 155 140 120 97 114 75 94 69 83 78 51 56 44 55 48
21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40
51 40 38 29 43 37 33 28 23 18 25 23 12 20 24 19 18 13 18 26
41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
18 23 20 19 16 13 25 17 15 27 31 17 15 28 14 22 20 19 13 13
61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80
12 16 13 14 11 10 14 16 6 5 9 5 7 5 5 6 7 3 1 6
81 82 83 84 85 86 87 88 89 91 92 93 95 96 97 98 99 100 101 102
2 2 1 4 2 5 1 3 3 2 3 1 1 3 1 1 1 1 1 1
103 104 105 107 111 112 116 121 133 141 177 202 247
2 3 2 1 1 1 1 1 1 1 1 1 1
> table(table(card(gc_nb1)))
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
785 481 312 225 165 168 139 112 122 101 89 86 75 75 55 69 54 46 37 49
21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40
57 31 41 37 45 32 27 28 43 23 27 11 21 21 20 16 14 11 15 21
41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
23 24 22 14 18 25 22 13 27 21 25 17 24 16 23 13 19 13 14 21
61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80
14 10 10 16 7 14 16 6 7 8 12 6 3 8 7 5 1 7 4 4
81 82 83 84 85 86 87 88 89 90 91 92 95 96 97 98 99 100 101 102
4 4 2 4 4 4 1 2 3 3 2 4 3 1 2 1 2 1 1 1
103 105 109 112 113 114 121 124 131 145 215 221 237
1 1 1 1 1 1 1 1 1 1 1 1 1
The buffering can be speeded up by reducing max_cells=
, but it doesn't really make sense as it affects accuracy. Finding out how to benefit from indexing to subset the by-geometry distance calculations, and how to use distance symmetry would be interesting at some point.
but it doesn't really make sense as it affects accuracy.
Not sure: I believe that the cells are guaranteed to include the actual (smooth) buffer, so less cells = larger area.
Yes, the buffered area is the exterior covering as currently implemented. I'm planning on implementing an interior covering as well when there's a vector class for an S2CellUnion to match the vector class for S2Cell. This should reduce memory usage for this technique considerably!
Yes, larger area in the buffer seems to mean that points are included by intersection with the buffer that are more than dist from the point of interest, doesn't it?
Yes...you'll have to refilter the points using s2_dwithin()
afterward (hopefully much faster).
Yes, this looks possible:
> library(sf)
Linking to GEOS 3.10.0dev, GDAL 3.3.0, PROJ 8.0.1
> usc71k_pts <- st_read("usc71k_pts.gpkg")
Reading layer `usc71k_pts' from data source
`/home/rsb/und/uam21/units_2/usc71k_pts.gpkg' using driver `GPKG'
Simple feature collection with 71353 features and 0 fields
Geometry type: POINT
Dimension: XY
Bounding box: xmin: -124.6272 ymin: 24.54924 xmax: -67.01773 ymax: 48.9887
Geodetic CRS: NAD83
> s2x <- st_as_s2(usc71k_pts)
> system.time(buf250 <- s2::s2_buffer_cells(s2x, dist=96000, max_cells=250))
user system elapsed
217.071 0.008 217.662
> system.time(i250 <- s2::s2_intersects_matrix(buf250, s2x))
user system elapsed
151.617 0.000 151.948
> system.time(i250_clip <- lapply(seq_along(i250), function(i) {
+ i250[[i]][s2::s2_dwithin(s2x[i], s2x[i250[[i]]], dist=96000)]}))
user system elapsed
61.187 0.000 61.341
For 71k points, max_cells 500 was 450s, 250 220s, with the intersects_matrix and dwithin inside lapply not varying. Can intersects_matrix or buffer_cells be parallelized, I think so? max_cells 150 is 115s for buffer, intersects increasing to 165s, lapply/dwithin 63s, so multicore looks feasible with maybe max_cells 200.
Perhaps wait a tiny bit for me to catch up on the additions I mentioned above...the cell union in particular I think will make most of this a mute point! At the very least worth trying before delving into the complexities of multicore.
In the embarrassingly parallel cases, mclapply()
did help, but you see the power of indexing in the k-nearest neighbour case at foot:
> library(sf)
Linking to GEOS 3.10.0dev, GDAL 3.3.0, PROJ 8.0.1
> usc71k_pts <- st_read("usc71k_pts.gpkg")
Reading layer `usc71k_pts' from data source
`/home/rsb/und/uam21/units_2/usc71k_pts.gpkg' using driver `GPKG'
Simple feature collection with 71353 features and 0 fields
Geometry type: POINT
Dimension: XY
Bounding box: xmin: -124.6272 ymin: 24.54924 xmax: -67.01773 ymax: 48.9887
Geodetic CRS: NAD83
> library(spdep)
Loading required package: sp
Loading required package: spData
> set.coresOption(4L)
NULL
# 4 cores for buffering (need to output to a list of 4 `"sfc"` objects as the s2 pointers get lost on return),
# intersects_matrix and dwithin on the points intersecting with the buffers:
> system.time(nb_buf4 <- dnearneigh(usc71k_pts, 0, 96, avoid_s2=FALSE, dwithin=FALSE))
user system elapsed
427.038 7.204 141.189
# single core legacy
> system.time(nb_leg <- dnearneigh(usc71k_pts, 0, 96, avoid_s2=TRUE))
user system elapsed
548.467 0.017 549.419
# 4 core dwithin_matrix, even slower than single core legacy
> system.time(nb_dw4 <- dnearneigh(usc71k_pts, 0, 96, avoid_s2=FALSE, dwithin=TRUE))
user system elapsed
3395.818 5.389 871.282
# buffer/intersects_matrix/dwithin gives same neighbours as dwithin_matrix
> all.equal(nb_buf4, nb_dw4, check.attributes=FALSE)
[1] TRUE
> nb_dw4
Neighbour list object:
Number of regions: 71353
Number of nonzero links: 83035752
Percentage nonzero weights: 1.63095
Average number of links: 1163.732
# single core knn with silly k
> system.time(nb_k1163 <- knearneigh(usc71k_pts, k=1163))
user system elapsed
59.290 0.228 60.006
# single core knn with realistic k
> system.time(nb_k1163 <- knearneigh(usc71k_pts, k=6))
user system elapsed
5.121 0.000 5.139
Both knn are far faster than legacy; distances between neighbours are also much faster. So now there is a clear basis for testing changes in s2 - these were run on the repo state as of this morning CEST. Using the input 71k tract boundaries single core and guiding legacy snapped contiguity detection by sf intersects between bounding boxes to establish candidate lists, with NA crs, 25s. Maybe I should revisit this and assign back an input GEOGCRS before doing the intersects
- done, no particular increase in time using s2_intersects_matrix()
, maybe 5s; does find fewer neighbours, though.
Following up and reading https://github.com/r-spatial/discuss/issues/43, and passing s2_model="closed"
to sf::st_intersects.sfc()
, I see that the differences occur where, for non-s2 settings, bounding box intersections are found and candidate neighbours listed, they are missed if the edges of the bounding boxes only touch, as in the small south-eastern tract here:
It is a candidate neighbour with NA or planar crs, not a candidate neighbour when using of s2_intersects_matrix()
in st_intersects()
:
Reprex: bbs.zip
library(sf)
bbs <- st_read("bbs.gpkg")
(s2_case <- sf::st_intersects(bbs))
s2::s2_intersects_matrix(bbs, bbs)
st_crs(bbs) <- as.character(NA)
(NA_case <- sf::st_intersects(bbs))
The 6th bounding box does not intersect the 1st bounding box when longlat crs, does when NA crs.
> (s2_case <- sf::st_intersects(bbs))
Sparse geometry binary predicate list of length 6, where the predicate
was `intersects'
1: 1, 2, 3, 4, 5
2: 1, 2, 3, 5, 6
3: 1, 2, 3, 4
4: 1, 3, 4, 5
5: 1, 2, 4, 5, 6
6: 2, 5, 6
> st_crs(bbs) <- as.character(NA)
> (NA_case <- sf::st_intersects(bbs))
Sparse geometry binary predicate list of length 6, where the predicate
was `intersects'
1: 1, 2, 3, 4, 5, 6
2: 1, 2, 3, 5, 6
3: 1, 2, 3, 4
4: 1, 3, 4, 5
5: 1, 2, 4, 5, 6
6: 1, 2, 5, 6
Is this just the s2_model - I've tried most of ?s2::s2_snap_level
, or am I missing other possibilities?
I'll look properly in a bit, but could it be that the edges of the bounding box are being interpreted as geodesic edges (the case in s2)? You could try a segmentize with NA crs before the intersection to test.
Thanks. Further, these are the bounding box corner points:
I tried st_segmentize()
with decreasing values of dfMaxLength=
, but the gains in intersection accuracy are small, and memory use increases rapidly. I think on balance that since many US census tracts have grid edges, they are a very hard case as the bounding boxes do not overlap as much as with more irregular geometries. Interesting to try, though.
I looked more carefully, and if no bounding box intersection is used, just st_intersects()
on the input polygons, then both in the planar and spherical case the problem is resolved. So the default will be to use "sfc"
polygons (or coerce to them), st_intersects()
, remove the repeated contiguities in the lower triangle, and check for 1 (Queen) or 2 (Rook) boundary points within snap=
.
That's good to know - it makes sense that cartesian bounding boxes are unlikely to play nicely with s2. An s2 cell union (e.g., s2_buffer_cells()
) is a more spherical-friendly way to simplify an arbitrary polygon if you need to do so.
I've opened up some issues for the action items in here...open up other issues if I missed any!
I've started looking at supplementing code for finding point neighbours within distance thresholds on the sphere in spdep through
sf::st_is_within_distance()
, which is OK. However, to return distances for the chosen neighbours on the sphere, in line 64 in https://github.com/r-spatial/spdep/blob/master/R/nbdists.R I iterate over subsets of points throughlapply()
- see also the examples at the foot of https://r-spatial.github.io/spdep/reference/dnearneigh.html - it would be better to iterate in compiled code rather than uselapply()
. IsLinkingTo
feasible?Finding k-nearest neighbours on the sphere looks even more complex for k > 1. Is there any such query in the s2 code itself? I looked at the documentation but couldn't see obvious opportunities, though I think they may be there somewhere, given that indexing is being used.
Current implementations use legacy C code for
spdep::dnearneigh()
,spdep::nbdists()
andspdep::knearneigh()
with Great Circle WGS84 ellipsoid distances and no indexing.Any suggestions would be very welcome!