r-spatial / spdep

Spatial Dependence: Weighting Schemes and Statistics
https://r-spatial.github.io/spdep/
116 stars 26 forks source link

Consider adding jitter option to knearneigh #152

Closed JosiahParry closed 1 month ago

JosiahParry commented 1 month ago

Often times point data can have the decimals truncated to 2 or 3 places. As a result, there can be duplicate points in the dataset. When there are duplicate points a KD-tree cannot be used and the results of knearneigh can be a bit odd and we can find that some points neighbors are duplicated.

In scenarios like this, it can be safe to assume that the points are approximations of a location. We can apply a very small jitter to the points to make sure that they are not identical points but are similarly proximate.

I think having this as an option in knearneigh could be very handy! For exmaple:

library(sf)
library(spdep)

st_knn <- function(geometry, k = 1, symmetric = FALSE, ...)  {
  ks <- spdep::knearneigh(geometry, k = k, ...)
  nb <- spdep::knn2nb(ks, sym = symmetric)
  nb
}

houses <- readr::read_csv("https://raw.githubusercontent.com/xj-liu/spatial_feature_incorporation/main/houses1990.csv") |> 
  st_as_sf(coords = c("longitude", "latitude"), crs = 4326)

locs <- st_geometry(houses)

# notice duplicate entries and warning regarding rbind issue
head(st_knn(locs, 10))

# here we apply very small jittering & now
# there are no warnings & we have similar answers
st_jitter(locs, 0.001) |> 
  st_knn(10) |> 
  head()
rsbivand commented 1 month ago

Please see: https://github.com/r-spatial/spdep/commit/a9e435bc8e24c7f293ae22dbef39832d80fb05df and https://github.com/r-spatial/spdep/commit/22f6f5f97a093525da1cfa1ca7e4e1c93321ba79 , try installing the development version:

> head(st_knn(locs, 10))
Error in spdep::knearneigh(geometry, k = k, ...) : 
  increase k; k must be at least as large as the largest count of identical points

The problem was that rbind came from s2 only including k+1 which may miss the i-th ID, which then needs to be deleted and rbind failed. The locations are typically apartments with the same front-door. It seems generally better to increase k to include as neighbours all observations at that point. A jitter means that arbitrary and random (set.seedis needed) points become neighbours even when they are 2D identical; further, the jitter value would have to be given in the units of the coordinates (1 foot, 0.3 m, 0.0001 degrees??). Jitter is feasible, but not a good idea. 3D knn is possible, but I think not with s2.