r-spatial / spdep

Spatial Dependence: Weighting Schemes and Statistics
https://r-spatial.github.io/spdep/
116 stars 26 forks source link

permutations in localG_perm #124

Closed rosieluain closed 1 year ago

rosieluain commented 1 year ago

Thank you for the very prompt fix #123 - the probabilities look more reasonable to me now. However, the G values are not as expected, which led me to pick up on another couple of potential issues:

  1. For both localG (line 49 in localG.R) and localG_perm (line 346 in lisa_perm.R), the final G values are calculated by: res <- res / sqrt(VG) For localG, VG is calculated from the dataset, but in localG_perm, VG is the variance of the expected value from the permutations. I am not sure why the variance in the permutations should affect the values of G. Should VG be calculated as for localG in both functions?

  2. In localG_perm, I see that the sampling is done with replacement. Of course values would need to be replaced within each column, but I think that replacement should be FALSE within each row (so you do not have any duplicate values/observations within the lagged sample).

Thank you again, and sorry for all the trouble.

rsbivand commented 1 year ago

Thanks for very careful reading of the code and well-motivated points. I've added 1) to code not yet committed/pushed to github. With regard to 2), an R-only implementation is an order of magnitude slower - because it iterates over nsim, and sampling without replacement is in any case slower. I'm loking at implementing in C to speed this up a little. Should I commit/push what I have already done? May I ask whether I could add you as a contributor to the package, in which case a name and optionally ORCiD would be needed.

rsbivand commented 1 year ago

Thanks for very careful reading of the code and well-motivated points. I've added 1). With regard to 2), an R-only implementation is an order of magnitude slower - because it iterates over nsim, and sampling without replacement is in any case slower. I've added an implementation in C to speed this up a little (2x not 12x case with replacement). May I ask whether I could add you as a contributor to the package, in which case a name and optionally ORCiD would be needed?

JosiahParry commented 1 year ago

Good looks @rosieluain! To confirm sampling is now done without replacement at $i$ (which makes sense a geography can't occur twice)?.

Regarding 1) this actualy came up in a convo with a colleague yesterday regarding the Local Moran. I'd love to understand this better. My understanding in using the permutation sample variance you are scaling by the expected variance under spatial randomness which to me feels a bit better.

Though in practice they're essentially identical

data(guerry, package = "sfdep")

library(spdep)
listw <- nb2listw(poly2nb(guerry$geometry))
x <- guerry$crime_pers

G <- localG(x, listw, return_internals = TRUE)
Gv <- attr(G, "internals")[,3]

Gp <- localG_perm(x, listw)
Gpv <- attr(Gp, "internals")[,3]

plot(Gpv, Gv, main = paste0("Corrlation: ", round(cor(Gpv, Gv), 2)))

Created on 2023-02-03 by the reprex package (v2.0.1)

rosieluain commented 1 year ago

@JosiahParry, good thought exercise!

I think that when using the permutation sample variance, you are right that you are scaling by the expected value and the variance of the expected value, for a spatially random dataset. This might be ok if the data were spatially random; however, we don't actually know that the data are random. So I think the expected value and variance need to be estimated from the (non-randomized) dataset itself. And then the purpose of the permutations is just to compare to the dataset to determine if the data are statistically different from spatially random.

What led me to identify this issue in the first place, is that if using the variance of the expected value from permutation, there is an issue when calculating G for a location that does not have any neighbours: The variance is 0 (if sampling without replacement), so the calculation for G fails when trying to divide by 0. If using the variance estimated from the data, a value for G is still possible, but the p value obtained through permutations is 1. This makes sense to me, because the G lets you see how this location compares to rest of the data, but p = 1 indicates there is no spatial structure within distance d (or there is not enough information to determine if there is spatial structure).

rosieluain commented 1 year ago

@rsbivand, Thank you, but no need to add me as a contributor. Thank you for being so attentive and helping me work through things!

rsbivand commented 1 year ago

Submission to CRAN to come 28 February morning CET - please indicate before then if any obvious impediments.

rsbivand commented 1 year ago

"adegraphics" https://github.com/sdray/adegraphics/pull/14 "adespatial" https://github.com/sdray/adespatial/pull/24 "spData" https://github.com/Nowosad/spData/pull/63 "Voyager" https://github.com/pachterlab/voyager/issues/3

Submission: 11:16 CET (230228) Thanks, on its way to CRAN: 14:02 (230228)