Closed rsbivand closed 1 year ago
Yeah, I think this is definitely something that can be done. I honestly have not been to completely understand the approach taken in localC. I have not spent time looking at lisa_perm.R
yet. Would the approach be better described as the "lookup-table" approach to p-value calculation as described in the libgeoda paper?
If there is a paper that qualifies the differences between these two approaches I haven't come across it yet 😅
No, it is based on creating a matrix of permutated draws and computing the simulated tests on the whole matrix at once, rather than separately for each simulation. I'll try to see what might be done.
If you'd like to add \dontrun{}
blocks to examples doing the same task in rgeoda (shorter than what is in ?localC
but showing that the two packages do the same thing), that would help. R will not be as fast as the C++ code there, but at least the proof of concept will be exposed.
About p-values, perhaps see: https://doi.org/10.1111/gean.12319, there is a discussion about the folded p-value problem. The resolution is I think to go for Luc Anselin's "interesting" replacing "significant" for all LISA "clusters" with a shift in cut-offs from (0.1, 0.05, 0.01) to (0.01, 0.005, 0.001) - which means more draws are essential. Maybe also see: https://r-spatial.org/book/15-Measures.html in WIP SDSR (in ch 15, the data input and construction of weights is described in ch 14, so look there first; we're still trying to stop Quarto right-trimming code blocks; sources in https://github.com/rsbivand/sdsr/tree/quarto).
@JosiahParry Please see https://github.com/r-spatial/spdep/pull/99.
With the new version:
> library(spdep)
Loading required package: sp
Loading required package: spData
Loading required package: sf
Linking to GEOS 3.11.0, GDAL 3.5.2, PROJ 9.1.0; sf_use_s2() is TRUE
> packageVersion("spdep")
[1] '1.2.7'
> columbus <- st_read(system.file("shapes/columbus.shp", package="spData"))
Reading layer `columbus' from data source
`/home/rsb/lib/r_libs/spData/shapes/columbus.shp' using driver `ESRI Shapefile'
Simple feature collection with 49 features and 20 fields
Geometry type: POLYGON
Dimension: XY
Bounding box: xmin: 5.874907 ymin: 10.78863 xmax: 11.28742 ymax: 14.74245
CRS: NA
> nb <- poly2nb(columbus)
> listw <- nb2listw(nb)
> system.time(res <- localmoran_bv(columbus$CRIME, columbus$INC, listw, nsim = 99))
user system elapsed
0.004 0.000 0.004
> system.time(res <- localmoran_bv(columbus$CRIME, columbus$INC, listw, nsim = 999))
user system elapsed
0.016 0.002 0.018
> system.time(res <- localmoran_bv(columbus$CRIME, columbus$INC, listw, nsim = 9999))
user system elapsed
0.154 0.019 0.174
> system.time(res <- localmoran_bv(columbus$CRIME, columbus$INC, listw, nsim = 99999))
user system elapsed
1.922 0.125 2.051
> save(res, file="~/tmp/res127.rda")
> require(parallel, quietly=TRUE)
> nc <- detectCores(logical=FALSE)-1L
> nc
[1] 5
> invisible(set.coresOption(nc))
> system.time(res <- localmoran_bv(columbus$CRIME, columbus$INC, listw, nsim = 99999))
user system elapsed
2.382 0.257 0.579
With main branch:
> library(spdep)
Loading required package: sp
Loading required package: spData
Loading required package: sf
Linking to GEOS 3.11.0, GDAL 3.5.2, PROJ 9.1.0; sf_use_s2() is TRUE
> packageVersion("spdep")
[1] '1.2.6'
> columbus <- st_read(system.file("shapes/columbus.shp", package="spData"))
Reading layer `columbus' from data source
`/home/rsb/lib/r_libs/spData/shapes/columbus.shp' using driver `ESRI Shapefile'
Simple feature collection with 49 features and 20 fields
Geometry type: POLYGON
Dimension: XY
Bounding box: xmin: 5.874907 ymin: 10.78863 xmax: 11.28742 ymax: 14.74245
CRS: NA
> nb <- poly2nb(columbus)
> listw <- nb2listw(nb)
> system.time(res <- localmoran_bv(columbus$CRIME, columbus$INC, listw, nsim = 99))
user system elapsed
0.044 0.000 0.044
> system.time(res <- localmoran_bv(columbus$CRIME, columbus$INC, listw, nsim = 99))
user system elapsed
0.044 0.000 0.044
> system.time(res <- localmoran_bv(columbus$CRIME, columbus$INC, listw, nsim = 999))
user system elapsed
0.460 0.007 0.469
> system.time(res <- localmoran_bv(columbus$CRIME, columbus$INC, listw, nsim = 9999))
user system elapsed
4.640 0.006 4.659
> system.time(res <- localmoran_bv(columbus$CRIME, columbus$INC, listw, nsim = 99999))
user system elapsed
50.730 0.057 50.933
> res126 <- res
> load("~/tmp/res127.rda")
> str(res126)
'data.frame': 49 obs. of 2 variables:
$ Ib : num -0.8578 0.1803 0.0119 -0.0164 -0.5038 ...
$ p_sim: num 0.149 0.429 0.5 0.413 0.035 ...
> str(res)
'localmoran' num [1:49, 1:7] -0.8578 0.1803 0.0119 -0.0164 -0.5038 ...
- attr(*, "dimnames")=List of 2
..$ : NULL
..$ : chr [1:7] "Ibvi" "E.Ibvi" "Var.Ibvi" "Z.Ibvi" ...
> all.equal(res126$Ib, res[,"Ibvi"])
[1] TRUE
> plot(res126$p_sim, res[,7])
@JosiahParry The article is open access, so should be available. It took a little time to set it OA as the Wiley agreement was not finalised when the article was accepted. Please also see #124 for further developments.
@rsbivand https://github.com/rsbivand do you have a pre-print of https://doi.org/10.1111/gean.12319 available?
@JosiahParry Let's try to replace
permute_listw()
before actually releasing a revised package, because Luc Anselin suggests that 0.005 should be the preferred criterion for an "interesting" p-value, which you need a large nsim to reach. Using the Columbus example fromlocalmoran_bv()
:So
permute_listw()
is linear innsim
and also sensitive ton
, and actually doesn't permit parallelization I think.localmoran_perm()
isn't parallelized here - with 6 cores:It might make sense to factor out the mechanisms in the lisa_perm.R file, which are copied in the two LISAs, and in
localC_perm_calc()
in localC.R, which is the closest model because it already has a multivariate extension. The logic in the parallelizable versions is thatlapply()
is run on subsets of1:n
(#subsets == #cores), permuting all buti
. The weights are not permutated.For now there is no
alternative=
in the new tests.What do you think?