Refactor local join count to utilize new permutation approach

JosiahParry commented 2 years ago

In relation to #98 and #99.

The current permute_listw() approach to calculating simulated p-values is inefficient. #99 introduces a better approach to accomplishing this. A similar approach should be taken for local join counts that were introduced in #94 and #97

JosiahParry commented 2 years ago

A rough implementation of this can be seen below.

This is somewhat inefficeint, however, because run_perm() runs on all observations. For the local join count not all observations will have a p-value. So it may be useful to modify run_perm() to take a vector of indexes as an alternative to the number of observations in the vector. In the below example the new implementation is significantly faster. The number of observations a p-value is required for is 23 ~46% of the observations. So we could theoretically cut the computation further in half in this use case by modifying run_perm().

local_jc_uni <- function(x, listw, nsim, alternative = "two.sided") {

  xj <- find_xj(x, listw$neighbours)
  # identify which observations should have a p-value
  x_index <- which(x == 1L)
  xj_index <- which(unlist(lapply(xj, function(x) any(x == 1L))) == TRUE)
  index <- intersect(xj_index, x_index)

  obs <- x * lag.listw(listw, x)

  crd <- card(listw$neighbours)
  lww <- listw$weights

  env <- new.env()

  assign("crd", crd, envir = env) # cardinality
  assign("lww", lww, envir = env) # weights
  assign("nsim", 999, envir=env) # weights
  assign("xi", x, envir = env) # x col
  assign("obs", obs, envir = env) # observed values
  varlist = ls(envir = env)

  permBB_int <- function(i, env) {

    crdi <- get("crd", envir = env)[i]
    x <- get("xi", envir = env)
    x_i <- x[-i]
    w_i <- get("lww", envir = env)[[i]]
    nsim <- get("nsim", envir = env)
    obs <- get("obs", envir = env)
    sx_i <- matrix(sample(x_i,
                          size = crdi * nsim,
                          replace = TRUE),
                   ncol = crdi,
                   nrow = nsim)

    res_i <- x[i] * (sx_i %*% w_i)
    # sum(res_i)
    rank(c(res_i, obs[i]))[(nsim + 1)]

  }

  probs <- probs_lut("BB", nsim, alternative)

  if (alternative == "two.sided") probs <- probs / 2

  p_ranks <- run_perm(permBB_int, length(x), env, NULL, varlist)

  # two-sided p-value
  ps <- probs
  p_res <- rep(NA_real_, length(x))
  p_res[index] <- ps[index]

  res <- data.frame(obs, p_res)
  colnames(res) <- c("BB", attr(probs, "Prname"))
  res
}

Example Bench Mark

data(oldcol)
x <- ifelse(COL.OLD$CRIME < 35, 0L, 1L)
listw <- nb2listw(COL.nb, style = "B")

bm <- bench::mark(
  og = local_joincount_uni(x, listw, nsim),
  new = local_jc_uni(x, listw, nsim),
  check = FALSE
)

#> # A tibble: 2 × 13
#>   expression      min   median `itr/sec` mem_alloc `gc/sec` n_itr  n_gc total_time result
#>   <bch:expr> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl> <int> <dbl>   <bch:tm> <list>
#>  1 og         219.14ms 229.91ms      4.41  179.52MB     7.35     3     5      680ms <NULL>
#>  2 new          8.28ms   8.48ms    110.      7.41MB     7.98    55     4      501ms <NULL>

ggplot2::autoplot(bm)

rsbivand commented 2 years ago

https://github.com/r-spatial/spdep/commit/757412c1cf2dac77de48daa542b30358a546fe5f splits indices (1:n in standard cases). idx can be a vector of indices (do we need to check that indices are unique)?

JosiahParry commented 2 years ago

I don't think it would hurt to add the check. I've not come across a scenario where having the same index multiple times would be useful.

rsbivand commented 2 years ago

https://github.com/r-spatial/spdep/commit/36cdb1fc8b1fba781d12ca6e378b5b594705430d adds a hard test for unique indices.

JosiahParry commented 2 years ago

From #99 you write

"The local join count sketches are much poorer than the existing global measures - should take factors not logical/integer, and should accommodate a choice of level rather than impose 0/1 only."

Anselin 2019 defines the local joincount—both bivariate and univariate—in the context of 1s and 0s e.g "for binary variables, coded as 0 and 1, the global spatial autocorrelation statistic of choice is the join-count statistic." For the local univariate join count I think it would be okay to utilize a factor. But how would be identify what would be considered presence or absense (1 and 0 respectively)? When you say "choice of level" is that what you mean? For example, we could let the user specify an argument observed or something like that. If not specified it could utilize the least frequent level.

data(oldcol)
fx <- cut(COL.OLD$CRIME, breaks=c(0,35,80), labels=c("low","high"))

# identify the least frequently observed class
(observed_level <- names(which.min(table(fx))))
#> [1] "high"

It is less clear to me how we would use factors in the bivariate case as we need to identify presence (1s) in two separate variables. The co-presence of 1 and 1 is used in the CLC case. In BJC case we can infer it because it necessitates that when when observation xi has the value 1 then zi is 0. Do you have thoughts on how we could handle this using two factors?

rsbivand commented 2 years ago

Yes, I though of chosen level as TRUE, all other as FALSE.
I thought of interaction between factors, such as https://rdrr.io/r/base/interaction.html

r-spatial / spdep

Refactor local join count to utilize new permutation approach #100

Example Bench Mark