Closed JosiahParry closed 1 year ago
A rough implementation of this can be seen below.
This is somewhat inefficeint, however, because run_perm()
runs on all observations. For the local join count not all observations will have a p-value. So it may be useful to modify run_perm()
to take a vector of indexes as an alternative to the number of observations in the vector. In the below example the new implementation is significantly faster. The number of observations a p-value is required for is 23 ~46% of the observations. So we could theoretically cut the computation further in half in this use case by modifying run_perm()
.
local_jc_uni <- function(x, listw, nsim, alternative = "two.sided") {
xj <- find_xj(x, listw$neighbours)
# identify which observations should have a p-value
x_index <- which(x == 1L)
xj_index <- which(unlist(lapply(xj, function(x) any(x == 1L))) == TRUE)
index <- intersect(xj_index, x_index)
obs <- x * lag.listw(listw, x)
crd <- card(listw$neighbours)
lww <- listw$weights
env <- new.env()
assign("crd", crd, envir = env) # cardinality
assign("lww", lww, envir = env) # weights
assign("nsim", 999, envir=env) # weights
assign("xi", x, envir = env) # x col
assign("obs", obs, envir = env) # observed values
varlist = ls(envir = env)
permBB_int <- function(i, env) {
crdi <- get("crd", envir = env)[i]
x <- get("xi", envir = env)
x_i <- x[-i]
w_i <- get("lww", envir = env)[[i]]
nsim <- get("nsim", envir = env)
obs <- get("obs", envir = env)
sx_i <- matrix(sample(x_i,
size = crdi * nsim,
replace = TRUE),
ncol = crdi,
nrow = nsim)
res_i <- x[i] * (sx_i %*% w_i)
# sum(res_i)
rank(c(res_i, obs[i]))[(nsim + 1)]
}
probs <- probs_lut("BB", nsim, alternative)
if (alternative == "two.sided") probs <- probs / 2
p_ranks <- run_perm(permBB_int, length(x), env, NULL, varlist)
# two-sided p-value
ps <- probs
p_res <- rep(NA_real_, length(x))
p_res[index] <- ps[index]
res <- data.frame(obs, p_res)
colnames(res) <- c("BB", attr(probs, "Prname"))
res
}
data(oldcol)
x <- ifelse(COL.OLD$CRIME < 35, 0L, 1L)
listw <- nb2listw(COL.nb, style = "B")
bm <- bench::mark(
og = local_joincount_uni(x, listw, nsim),
new = local_jc_uni(x, listw, nsim),
check = FALSE
)
#> # A tibble: 2 × 13
#> expression min median `itr/sec` mem_alloc `gc/sec` n_itr n_gc total_time result
#> <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl> <int> <dbl> <bch:tm> <list>
#> 1 og 219.14ms 229.91ms 4.41 179.52MB 7.35 3 5 680ms <NULL>
#> 2 new 8.28ms 8.48ms 110. 7.41MB 7.98 55 4 501ms <NULL>
ggplot2::autoplot(bm)
https://github.com/r-spatial/spdep/commit/757412c1cf2dac77de48daa542b30358a546fe5f splits indices (1:n in standard cases). idx
can be a vector of indices (do we need to check that indices are unique)?
I don't think it would hurt to add the check. I've not come across a scenario where having the same index multiple times would be useful.
https://github.com/r-spatial/spdep/commit/36cdb1fc8b1fba781d12ca6e378b5b594705430d adds a hard test for unique indices.
From #99 you write
"The local join count sketches are much poorer than the existing global measures - should take factors not logical/integer, and should accommodate a choice of level rather than impose 0/1 only."
Anselin 2019 defines the local joincount—both bivariate and univariate—in the context of 1s and 0s e.g "for binary variables, coded as 0 and 1, the global spatial autocorrelation statistic of choice is the join-count statistic." For the local univariate join count I think it would be okay to utilize a factor. But how would be identify what would be considered presence or absense (1 and 0 respectively)? When you say "choice of level" is that what you mean? For example, we could let the user specify an argument observed
or something like that. If not specified it could utilize the least frequent level.
data(oldcol)
fx <- cut(COL.OLD$CRIME, breaks=c(0,35,80), labels=c("low","high"))
# identify the least frequently observed class
(observed_level <- names(which.min(table(fx))))
#> [1] "high"
It is less clear to me how we would use factors in the bivariate case as we need to identify presence (1s) in two separate variables. The co-presence of 1 and 1 is used in the CLC case. In BJC case we can infer it because it necessitates that when when observation xi has the value 1 then zi is 0. Do you have thoughts on how we could handle this using two factors?
Yes, I though of chosen level as TRUE, all other as FALSE.
I thought of interaction between factors, such as https://rdrr.io/r/base/interaction.html
In relation to #98 and #99.
The current
permute_listw()
approach to calculating simulated p-values is inefficient. #99 introduces a better approach to accomplishing this. A similar approach should be taken for local join counts that were introduced in #94 and #97