r-spatial / spdep

Spatial Dependence: Weighting Schemes and Statistics
https://r-spatial.github.io/spdep/
116 stars 26 forks source link

Deprecate permute_listw() #98

Closed rsbivand closed 1 year ago

rsbivand commented 1 year ago

@JosiahParry Let's try to replace permute_listw() before actually releasing a revised package, because Luc Anselin suggests that 0.005 should be the preferred criterion for an "interesting" p-value, which you need a large nsim to reach. Using the Columbus example from localmoran_bv():

> system.time(res_xy <- localmoran_bv(x, y, listw, nsim=99))
   user  system elapsed 
  0.046   0.000   0.047 
> system.time(res_xy <- localmoran_bv(x, y, listw, nsim=999))
   user  system elapsed 
  0.471   0.000   0.473 
> system.time(res_xy <- localmoran_bv(x, y, listw, nsim=9999))
   user  system elapsed 
  4.833   0.004   4.851 
> system.time(res_xy <- localmoran_bv(x, y, listw, nsim=99999))
   user  system elapsed 
 51.106   0.045  51.300 
> system.time(res_x <- localmoran_perm(x, listw, nsim=99))
   user  system elapsed 
  0.007   0.000   0.007 
> system.time(res_x <- localmoran_perm(x, listw, nsim=999))
   user  system elapsed 
  0.022   0.000   0.022 
> system.time(res_x <- localmoran_perm(x, listw, nsim=9999))
   user  system elapsed 
  0.191   0.000   0.191 
> system.time(res_x <- localmoran_perm(x, listw, nsim=99999))
   user  system elapsed 
  2.268   0.031   2.310 

So permute_listw() is linear in nsim and also sensitive to n, and actually doesn't permit parallelization I think. localmoran_perm() isn't parallelized here - with 6 cores:

> get.mcOption()
[1] TRUE
> get.coresOption()
NULL
> set.coresOption(6L)
NULL
> get.coresOption()
[1] 6
> system.time(res_x <- localmoran_perm(x, listw, nsim=99999))
   user  system elapsed 
  2.209   0.245   0.573 

It might make sense to factor out the mechanisms in the lisa_perm.R file, which are copied in the two LISAs, and in localC_perm_calc() in localC.R, which is the closest model because it already has a multivariate extension. The logic in the parallelizable versions is that lapply() is run on subsets of 1:n (#subsets == #cores), permuting all but i. The weights are not permutated.

For now there is no alternative= in the new tests.

What do you think?

JosiahParry commented 1 year ago

Yeah, I think this is definitely something that can be done. I honestly have not been to completely understand the approach taken in localC. I have not spent time looking at lisa_perm.R yet. Would the approach be better described as the "lookup-table" approach to p-value calculation as described in the libgeoda paper?

If there is a paper that qualifies the differences between these two approaches I haven't come across it yet 😅

rsbivand commented 1 year ago

No, it is based on creating a matrix of permutated draws and computing the simulated tests on the whole matrix at once, rather than separately for each simulation. I'll try to see what might be done.

If you'd like to add \dontrun{} blocks to examples doing the same task in rgeoda (shorter than what is in ?localC but showing that the two packages do the same thing), that would help. R will not be as fast as the C++ code there, but at least the proof of concept will be exposed.

rsbivand commented 1 year ago

About p-values, perhaps see: https://doi.org/10.1111/gean.12319, there is a discussion about the folded p-value problem. The resolution is I think to go for Luc Anselin's "interesting" replacing "significant" for all LISA "clusters" with a shift in cut-offs from (0.1, 0.05, 0.01) to (0.01, 0.005, 0.001) - which means more draws are essential. Maybe also see: https://r-spatial.org/book/15-Measures.html in WIP SDSR (in ch 15, the data input and construction of weights is described in ch 14, so look there first; we're still trying to stop Quarto right-trimming code blocks; sources in https://github.com/rsbivand/sdsr/tree/quarto).

rsbivand commented 1 year ago

@JosiahParry Please see https://github.com/r-spatial/spdep/pull/99.

With the new version:

> library(spdep)
Loading required package: sp
Loading required package: spData
Loading required package: sf
Linking to GEOS 3.11.0, GDAL 3.5.2, PROJ 9.1.0; sf_use_s2() is TRUE
> packageVersion("spdep")
[1] '1.2.7'
> columbus <- st_read(system.file("shapes/columbus.shp", package="spData"))
Reading layer `columbus' from data source 
  `/home/rsb/lib/r_libs/spData/shapes/columbus.shp' using driver `ESRI Shapefile'
Simple feature collection with 49 features and 20 fields
Geometry type: POLYGON
Dimension:     XY
Bounding box:  xmin: 5.874907 ymin: 10.78863 xmax: 11.28742 ymax: 14.74245
CRS:           NA
> nb <- poly2nb(columbus)
> listw <- nb2listw(nb)
> system.time(res <- localmoran_bv(columbus$CRIME, columbus$INC, listw, nsim = 99))
   user  system elapsed 
  0.004   0.000   0.004 
> system.time(res <- localmoran_bv(columbus$CRIME, columbus$INC, listw, nsim = 999))
   user  system elapsed 
  0.016   0.002   0.018 
> system.time(res <- localmoran_bv(columbus$CRIME, columbus$INC, listw, nsim = 9999))
   user  system elapsed 
  0.154   0.019   0.174 
> system.time(res <- localmoran_bv(columbus$CRIME, columbus$INC, listw, nsim = 99999))
   user  system elapsed 
  1.922   0.125   2.051 
> save(res, file="~/tmp/res127.rda")
> require(parallel, quietly=TRUE)
> nc <- detectCores(logical=FALSE)-1L
> nc
[1] 5
> invisible(set.coresOption(nc))
> system.time(res <- localmoran_bv(columbus$CRIME, columbus$INC, listw, nsim = 99999))
   user  system elapsed 
  2.382   0.257   0.579 

With main branch:

> library(spdep)
Loading required package: sp
Loading required package: spData
Loading required package: sf
Linking to GEOS 3.11.0, GDAL 3.5.2, PROJ 9.1.0; sf_use_s2() is TRUE
> packageVersion("spdep")
[1] '1.2.6'
> columbus <- st_read(system.file("shapes/columbus.shp", package="spData"))
Reading layer `columbus' from data source 
  `/home/rsb/lib/r_libs/spData/shapes/columbus.shp' using driver `ESRI Shapefile'
Simple feature collection with 49 features and 20 fields
Geometry type: POLYGON
Dimension:     XY
Bounding box:  xmin: 5.874907 ymin: 10.78863 xmax: 11.28742 ymax: 14.74245
CRS:           NA
> nb <- poly2nb(columbus)
> listw <- nb2listw(nb)
> system.time(res <- localmoran_bv(columbus$CRIME, columbus$INC, listw, nsim = 99))
   user  system elapsed 
  0.044   0.000   0.044 
> system.time(res <- localmoran_bv(columbus$CRIME, columbus$INC, listw, nsim = 99))
   user  system elapsed 
  0.044   0.000   0.044 
> system.time(res <- localmoran_bv(columbus$CRIME, columbus$INC, listw, nsim = 999))
   user  system elapsed 
  0.460   0.007   0.469 
> system.time(res <- localmoran_bv(columbus$CRIME, columbus$INC, listw, nsim = 9999))
   user  system elapsed 
  4.640   0.006   4.659 
> system.time(res <- localmoran_bv(columbus$CRIME, columbus$INC, listw, nsim = 99999))
   user  system elapsed 
 50.730   0.057  50.933
> res126 <- res
> load("~/tmp/res127.rda")
> str(res126)
'data.frame':   49 obs. of  2 variables:
 $ Ib   : num  -0.8578 0.1803 0.0119 -0.0164 -0.5038 ...
 $ p_sim: num  0.149 0.429 0.5 0.413 0.035 ...
> str(res)
 'localmoran' num [1:49, 1:7] -0.8578 0.1803 0.0119 -0.0164 -0.5038 ...
 - attr(*, "dimnames")=List of 2
  ..$ : NULL
  ..$ : chr [1:7] "Ibvi" "E.Ibvi" "Var.Ibvi" "Z.Ibvi" ...
> all.equal(res126$Ib, res[,"Ibvi"]) 
[1] TRUE
> plot(res126$p_sim, res[,7])

image

rsbivand commented 1 year ago

@JosiahParry The article is open access, so should be available. It took a little time to set it OA as the Wiley agreement was not finalised when the article was accepted. Please also see #124 for further developments.

@rsbivand https://github.com/rsbivand do you have a pre-print of https://doi.org/10.1111/gean.12319 available?