r-spatial / spdep

Spatial Dependence: Weighting Schemes and Statistics
https://r-spatial.github.io/spdep/
122 stars 26 forks source link

nb print method with large `nb` objects crashes R #160

Closed JosiahParry closed 4 weeks ago

JosiahParry commented 3 months ago

I'm unsure what specifically about the nb class print method is causing this, but I've noticed it previously and am encountering it again today.

With somewhat large neighbor matrices, spdep will hang in computing for quite some time leading to RStudio reporting that R is not responding.

housing <- read.csv("https://raw.githubusercontent.com/xj-liu/spatial_feature_incorporation/main/houses1990.csv") |> 
  sf::st_as_sf(coords = c("longitude", "latitude"), crs = 4326) |> 
  sf::st_jitter(0.0001)

# calculate the maximum distance for K = 1 as the distance band
upper_bound <- sfdep::critical_threshold(housing$geometry)

nb <- spdep::dnearneigh(
    housing$geometry,
    0, upper_bound
)

# printing here will cause R to hang
nb
JosiahParry commented 3 months ago

I think it may be this part: {n} disjoint connected subgraphs

rsbivand commented 3 months ago

https://github.com/r-spatial/spdep/blob/main/R%2Fsummary.nb.R#L65 was added to declare the subgraph count. This is done regularly in CARBayes and other downstream modelling packages because if the weights matrix is block diagonal, there could be multiple measures, one for each block. See also the end of https://r-spatial.org/book/14-Areal.html#contiguous-neighbours, where n.comp.nb is discussed. Maybe igraph might be faster. Don't trust RStudio about session non-responsiveness. I'll try to check how the function performs both with large n and with dense neighbours. Maybe run under debug to check the location of the slow-running line, or run n.comp.nb on the neighour object.

rsbivand commented 3 months ago

The implementation by Nicholas Lewin-Koh from 2001 uses Depth First Search; https://en.wikipedia.org/wiki/Depth-first_search gives its time complexity as order of node count plus edge count. The time complexity for igraph::components seems to be the same: https://igraph.org/c/doc/igraph-Structural.html#igraph_connected_components. igraph::is_connected has the same time complexity, unfortunately.

In: , I added an igraph option to m.comp.nb, house has ~ 25000 points:

> library(spdep)
Loading required package: spData
Loading required package: sf
Linking to GEOS 3.12.2, GDAL 3.9.0, PROJ 9.4.1; sf_use_s2() is TRUE
> data(house)
> house <- sf::st_as_sf(house)
> dim(house)
[1] 25357    25
> k6 <- knn2nb(knearneigh(house, k=6))
> system.time(o <- n.comp.nb(k6, igraph=FALSE))
   user  system elapsed 
  1.617   0.004   1.634 
> system.time(o <- n.comp.nb(k6, igraph=TRUE))
   user  system elapsed 
  2.986   0.058   3.055 
> k60 <- knn2nb(knearneigh(house, k=60))
> system.time(o <- n.comp.nb(k60, igraph=FALSE))
   user  system elapsed 
 12.218   0.020  12.307 
> system.time(o <- n.comp.nb(k60, igraph=TRUE))
   user  system elapsed 
 16.574   0.089  16.727 
> k600 <- knn2nb(knearneigh(house, k=600))
> system.time(o <- n.comp.nb(k600, igraph=FALSE))
   user  system elapsed 
275.825   0.107 277.011 
> system.time(o <- n.comp.nb(k600, igraph=TRUE))
   user  system elapsed 
322.273   1.327 325.135 

with |V| always 25357, but here |E| k=6: 183834, k=60: 1776536; k=600: 18267908. So the edge count is driving most of the time complexity. However, the nb objects are made symmetric inside n.comp.nb, and make.sym.nb takes most of the time, which needs to be handled. I'm looking at that now.

rsbivand commented 3 months ago
> library(spdep)
Loading required package: spData
Loading required package: sf
Linking to GEOS 3.12.2, GDAL 3.9.0, PROJ 9.4.1; sf_use_s2() is TRUE
> data(house)
> house <- sf::st_as_sf(house)
> k6 <- knn2nb(knearneigh(house, k=6))
> system.time(o <- n.comp.nb(k6, igraph=FALSE))
   user  system elapsed 
  1.603   0.002   1.618 
> system.time(o <- n.comp.nb(k6, igraph=TRUE))
   user  system elapsed 
  1.522   0.059   1.596 
> k60 <- knn2nb(knearneigh(house, k=60))
> system.time(o <- n.comp.nb(k60, igraph=FALSE))
   user  system elapsed 
 11.925   0.006  11.979 
> system.time(o <- n.comp.nb(k60, igraph=TRUE))
   user  system elapsed 
  6.589   0.128   6.736 
> k600 <- knn2nb(knearneigh(house, k=600))
> system.time(o <- n.comp.nb(k600, igraph=FALSE))
   user  system elapsed 
267.720   0.081 268.524 
> system.time(o <- n.comp.nb(k600, igraph=TRUE))
   user  system elapsed 
 73.950   1.633  75.786 

Commit (including some other correction of igraph use): https://github.com/r-spatial/spdep/commit/80262d0af413a3151de65de4ec35caa1ec4b1da5. Also added in the igraph=FALSE setting the possibility to interrupt the C function doing the depth-first search.

rsbivand commented 3 months ago

I dropped the igraph= argument, and in commit https://github.com/r-spatial/spdep/pull/161/commits/8e8fdbeadf2c8c990e32bf935910f6d42d682d31 detect igraph and spatialreg in n.comp.nb and use them if available. The symmetry attribute in the nb object is used. When nb objects are created, an "ncomp" attribute is added then used subsequently where needed, controlled by get.SubgraphOption (default TRUE) and get.SubgraphCeiling (default 10000) being less than |N|+|E|.

JosiahParry commented 3 months ago

Brilliant thank you so much! I think this will be an improvement for users!

rsbivand commented 3 months ago

Running reverse dependency checks now, should be OK, but I'll also contact packages maintainers using n.comp.nb to alert them to the changes, and to ask for input before moving forward.

rsbivand commented 3 months ago

These use n.comp.nb:

The changes do not cause errors, but maybe they could use attr(, "ncomp") instead of recalculating the output of n.comp.nb, as for example now in spatialreg::isCyclical: https://github.com/r-spatial/spatialreg/blob/d0ce698e65113ad2aecbabd80006e5d20f3039d1/R/cyclical.R#L8-L9