r-spatial / spdep

Spatial Dependence: Weighting Schemes and Statistics
https://r-spatial.github.io/spdep/
122 stars 26 forks source link

Issues with mat2listw function to create spatial weights for local Moran calculation #110

Closed Ramon-88 closed 1 year ago

Ramon-88 commented 1 year ago

Hi everyone and thanks for developing this amazing package.

My aim is to define a spatial weight matrix using travel time distance in order to calculate local Moran index.

To do this, firstly I calculated travel times distance between coordinates and got a square matrix, then I created its inverse so that closer values have a larger weight and vice versa.

Once I got the inverse distance matrix I used the mat2listw function to create spatial weights to pass in local moran function.

This is the code I used:

ttm #distance travel matrix between coordinates

#travel times distance inverse matrix calculation matr_dist <- as.matrix(ttm) matr_inv <- 1/matr_dist diag(matr_inv) <- 0

#Definition of the spatial weights listw_ttm <- mat2listw(matr_inv , row.names = row.names(matr_inv), style = "W")

When I pass it in localmoran apparently it works fine, the problems is that I got results that are, also checking global moran moran.test and using a distance based criteria (dnearneigh) to define spatial weights, unusually low.

I'm wondering if I'm following the right procedure for defining the spatial weights from the travel time matrix (and in this case I need to check and rethinking about my input data) or if there is something wrong with the way I'm doing it. Thanks in advance for any help or suggestion.

rsbivand commented 1 year ago

Reproducible example please, really no way of knowing otherwise. It is very often the case that all-to-all distances and travel times actually tell nothing more than contiguity for polygon objects, so the extrawork is pointless. They also run into Smith's finding that dense weights matrices over-smooth spatial processes.

Ramon-88 commented 1 year ago

Thanks for your answer. I made a reproducible example hoping this can be useful to make my question clearer. As you pointed out temporal distances may not be a useful choice to defining spatial relationships between areal units. Since I am doing some tests, I would like to understand if there is something wrong with the procedure I'm using to define the spatial weights from the travel time matrix or if it is just a matter of the input data and the criteria used to define the weights.

library(sf) library(sp) library(spdep)

# import shapefile fname <- system.file("shape/nc.shp", package="sf") nc <- st_read(fname)

#variable of interest df <- data.frame(variable = c(-0.001714399, 0.256668055, -0.597164275, -0.566573086, -0.002722550, -0.002733780, -0.009264913, -0.003621270, 0.371953423, -0.727381311, -0.732099064, -0.549915871, -0.533963972, -0.787935963, -0.000292474, -0.677922405, -0.765987376, -0.284974522, -0.764320275, -0.008769582, -0.005902430, -0.000826053, -0.694321497, -0.700700586, -0.662195753, -0.883479707, -0.735420022, -0.133263436, -0.884552823, -0.004850738, -0.710551003, -0.829068668, -0.698893260, -0.001848377, -0.000748199, -0.875747815, -0.009846376, -0.801457592, -0.481369203, -0.712644719, -0.966750169, -0.000653413, -0.000209122, -0.003707078, -0.002011228, -0.000226881, -0.645245881, -0.000830129, -0.004492861, -0.004190275, 0.533890900, -0.752579438, -0.266293097, -0.086249522, 0.003788232, 0.000254342, -0.000165757, -0.735689142, -0.008107195, -0.001087987, -0.003541964, 0.000788232, 0.259767521, 0.006927175, -0.059271680, -0.057473118, -0.007432981, -0.068429217, -0.031814225, -0.530987616, 0.218492370, -0.097229482, -0.009933875, -0.074403601, 0.016554854, -0.002876477, -0.042146313, 0.054138943, 0.028527471, -0.055526646, -0.255526646, -0.008721035, -0.264547228, -0.026571293, -0.031856857, -0.071923958, 0.126946977, -0.245643891, 0.213177432, -0.000066652, -0.007620088, -0.02177310, 0.045515998, 0.053253643, 0.062845708, -0.010811853, 0.066837906, -0.000366652, 0.000298832, 0.000298843))

#add variable to shapefile nc$variable <- df$variable

#get centroids centroids <- st_centroid(nc) #get spatial points s_point <- as(centroids, "Spatial") lon_lat <- coordinates(s_point) lon_lat <- as.data.frame(lon_lat) colnames(lon_lat) <- c("x","y")

#simulated travel time distance results set.seed(659235) ttm <- matrix(sample(1:80, 10000, replace = TRUE), ncol = 100)

#travel times inverse distance matrix calculation matr_inv <- 1/ttm diag(matr_inv) <- 0

#Definition of spatial weights listw_ttm <- mat2listw(matr_inv , row.names = row.names(matr_inv), style = "W")

#moran calculation moran.test(nc$variable,listw_ttm)

rsbivand commented 1 year ago

So what is the problem? No errors, please do not use tigris which requires internet acces (I'm travelling).

Ramon-88 commented 1 year ago

I updated the example code with a shapefile from sf package. Thanks for checking out the code, my main concern was related to the correctness of the procedure but now I know that's all about my input data as well as the travel matrix I use to create a listw object. I'd like to ask a last thing: when one uses criteria like queen or rook contiguity it's easy identifying which are the neighbourhoods of an areal unit, but in the case of an inverse distance matrix it's not, as they are all connected and it's just a question of different weights. I was wondering if there is a way to pick out units that are considered closer, also in order to represent it in a plot.

rsbivand commented 1 year ago
ttm0 <- st_distance(centroids)
ttm1 <- units::set_units(ttm0, "miles")
ttm2 <- ttm1 / units::as_units(40, "miles/h")
ttm3 <- units::set_units(ttm2, "minutes")
matr_inv <- 1/ttm3
diag(matr_inv) <- 0
listw_ttm <- mat2listw(unclass(matr_inv), row.names = row.names(matr_inv), style = "W")
moran.test(ia_counties$variable,listw_ttm)

Your random travel times are very unhelpful - so here replaced by 40 mph estimates, in addition to the unnecessary use of Iowa when North Carolina ships with sf. It is still very unclear why the outcomes you observe from your data are in your view inappropriate. Late posting, crosses your comment.

rsbivand commented 1 year ago

Combining NC, your data and my listw_ttm:

> moran.test(nc$variable,listw_ttm)

    Moran I test under randomisation

data:  nc$variable  
weights: listw_ttm    

Moran I statistic standard deviate = 4.0011, p-value = 3.153e-05
alternative hypothesis: greater
sample estimates:
Moran I statistic       Expectation          Variance 
     0.0430275158     -0.0101010101      0.0001763212 

Random and asymmetric times may be the problem, but in any case are irrelevant, are they not?

Ramon-88 commented 1 year ago

Thanks for your further explanation. Moran's I based on travel distance in this example works fine. I just was confused as I got very low results using my real travel time estimation and as I couldn't find clear online tutorials on the use of inverse distance matrix and travel time to create spatial weights, I started to think that I was missing something or doing it wrong. But now I know that I just need to think about the usefulness of using such a criteria based on travel times (also in light of what you said in your first comment) and maybe consider others that can be more appropriate in catching spatial processes.

rsbivand commented 1 year ago

You could create a more sparse neighbour objecct, then assign inverse travel times to the links through the glist= argument to nb2listw(). The all-to-all inverse times then emerge (by powering the matrix) implicitly. The Smith reference is DOI: 10.1111/j.1538-4632.2009.00758.x .

Ramon-88 commented 1 year ago

I haven't considered it before but it sounds like a good solution, I'll try to learn more and make some tests to see how it works on my data. Thanks for helping and also for the reference. I'll definitely have a look at it.

rsbivand commented 1 year ago

Example using sparse general weights:

library(sf)
library(spdep)

# import file
fname <- system.file("gpkg/nc.gpkg", package="sf")
nc <- st_read(fname) 
#variable of interest
df <- data.frame(variable = c(-0.001714399, 0.256668055, -0.597164275, -0.566573086, -0.002722550, -0.002733780,
-0.009264913, -0.003621270, 0.371953423, -0.727381311, -0.732099064, -0.549915871,
-0.533963972, -0.787935963, -0.000292474, -0.677922405, -0.765987376, -0.284974522,
-0.764320275, -0.008769582, -0.005902430, -0.000826053, -0.694321497, -0.700700586,
-0.662195753, -0.883479707, -0.735420022, -0.133263436, -0.884552823, -0.004850738,
-0.710551003, -0.829068668, -0.698893260, -0.001848377, -0.000748199, -0.875747815,
-0.009846376, -0.801457592, -0.481369203, -0.712644719, -0.966750169, -0.000653413,
-0.000209122, -0.003707078, -0.002011228, -0.000226881, -0.645245881, -0.000830129,
-0.004492861, -0.004190275, 0.533890900, -0.752579438, -0.266293097, -0.086249522,
0.003788232, 0.000254342, -0.000165757, -0.735689142, -0.008107195, -0.001087987,
-0.003541964, 0.000788232, 0.259767521, 0.006927175, -0.059271680, -0.057473118,
-0.007432981, -0.068429217, -0.031814225, -0.530987616, 0.218492370, -0.097229482,
-0.009933875, -0.074403601, 0.016554854, -0.002876477, -0.042146313, 0.054138943,
0.028527471, -0.055526646, -0.255526646, -0.008721035, -0.264547228, -0.026571293,
-0.031856857, -0.071923958, 0.126946977, -0.245643891, 0.213177432, -0.000066652,
-0.007620088, -0.02177310, 0.045515998, 0.053253643, 0.062845708, -0.010811853,
0.066837906, -0.000366652, 0.000298832, 0.000298843))

#add variable to object
nc$variable <- df$variable
nb_q <- poly2nb(nc)
nb_q

#get centroids
centroids <- st_centroid(st_geometry(nc))
nb_dists <- nbdists(nb_q, centroids) # km (at 60 km/h, minutes)
summary(unlist(nb_dists))
glist <- lapply(nb_dists, function(x) 1/x)
summary(unlist(glist))
lwB <- nb2listw(nb_q, glist=glist, style="B") # binary general
lwW <-  nb2listw(nb_q, glist=glist, style="W") # row-standardized general
moran.test(nc$variable, lwB)
moran.test(nc$variable, lwW)
lwS <-  nb2listw(nb_q, glist=glist, style="S") # variance stabilizing general
moran.test(nc$variable, lwS)

Examining the powering effect (I - rho W)^{-1} = sum I + rho W + (rho W)^ 2 + ...

library(spatialreg)
B <- as(lwB, "CsparseMatrix")
str(B)
str(B%*%B) # filling in and reducing in value (x component)
range(eigenw(lwB))
rho <- 0.05
image(Diagonal(100) + rho*B)
image(Diagonal(100) + rho*B + rho*rho*B%*%B)
image(Diagonal(100) + rho*B + rho*rho*B%*%B + rho*rho*rho*B%*%B%*%B)
Ramon-88 commented 1 year ago

I tried to use your code and I got good insights about how it works on my data. Actually, I think that relying on sparse general weights makes even more sense in my case as I'm working on a shapefile made of about 1400 regular grids, so many connections are quite irrelevant, I could just use 1st or 2nd queen contiguity order. It was very helpful advice, really thanks.

As I would use my travel time estimations is there any way to pass them in the glist= argument to nb2listw()? So recalling the above example code:

ttm0<- st_distance(centroids) ttm1 <- units::set_units(ttm0, "miles") ttm2 <- ttm1 / units::as_units(40, "miles/h") ttm3 <- units::set_units(ttm2, "minutes") matr_inv <- 1/ttm3 diag(matr_inv) <- 0

Is there any way to take the matr_inv object and pass it in the glist= argument?

lwW <- nb2listw(nb_q, glist=glist, style="W") # row-standardized general

I made some attempts but without success so far.

rsbivand commented 1 year ago

No, the object has to be of the same shape as nb; the nb list of integer vectors is the sparse representation of a graph; glist is a list containing the general weight values for observations defined as neighbours in the nb object. nbdists() is provided to calculate distances only for observations defined as neighbours. It however removes units properties, what you get are distance units in the metric returned by st_distance() - if unprojected coordinates, in kilometres for back-compatibility reasons.

nb_dists <- nbdists(nb_q, centroids) # km
nb_dists_miles <- lapply(nb_dists, function(x) units::set_units(units::set_units(x, "km"), "miles"))
nb_dists_hours <- lapply(nb_dists_miles, function(x) x / units::as_units(40, "miles/h"))
nb_dists_minutes <- lapply(nb_dists_hours, function(x) units::set_units(x, "minutes"))
matr_inv <- lapply(nb_dists_minutes, function(x) 1/unclass(x))
lwW <- nb2listw(nb_q, glist=matr_inv, style="W")
lwW

should be OK.

Ramon-88 commented 1 year ago

Everything is clear now. I tried to use a sparse matrix to define the spatial weight structure, and it works fine, also in light of its methodological implications for detecting spatial processes. Once again, thanks for your explanations and all the help you gave me, I really appreciated it.

Ramon-88 commented 1 year ago

Dear Roger, this is probably not the right place to ask this kind of question, but since I have noticed the space I created opening the issue is still open, I will try, hoping not to bother you. Perhaps this might help users of the spdep package to use some of its functions with greater awareness. I am trying to find a satisfactory answer to the following question, but so far, I have yet to succeed: what is the range of local moran values? is there any minimum or maximum value (as for the global moran index, -1 and 1)? From what I found having a deep look at the literature, scholars refer to its range just saying that it can assume positive or negative values, but nothing is told about the minimum and maximum values it can assume to help its interpretation. In addition, from what I understood, the choice of weights should further affect the values it can return and its range. Any insights about this question or some references where to find the answer would be very appreciated, thanks in advance.

rsbivand commented 1 year ago

The range of global Moran's I is also unknown, and is not (-1, 1) in general. Please refer to ?localmoran.sad and ?localmoran.exact for references to articles:

Tiefelsdorf, M. 2002 The Saddlepoint approximation of Moran's I and local Moran's Ii reference distributions and their numerical evaluation. Geographical Analysis, 34, pp. 187-206.

Bivand RS, Müller W, Reder M (2009) Power calculations for global and local Moran’s I. Comput Stat Data Anal 53:2859–2872.

Ramon-88 commented 1 year ago

Thanks for the references you sent me, I have read both papers and from what I understood, in short, is that the value of the local Moran index is determined by the minimum and maximum eigenvalues of the spatial weights matrix, so different coding schemes will give different ranges. What I'm still missing, but for my limitation, is how to interpret the local Moran values We know that a positive value of the local Moran statistic indicates that a location i has neighbouring locations with similarly high or low values, while a negative value informs us that the location i is surrounded by units with dissimilar values. But what is the difference between an areal unit reporting a value of let's say,7.009 and an areal unit with a local moran value of 1.082? Both the values inform us that the units have neighbouring locations with similarly high or low values, but 1.082 it is not the same of 7.009. Does the higher value tell us something about the strength of the relationship with the other units, or is it somehow related to the variance of the variable under study?

rsbivand commented 1 year ago

You seem to mean inference. For variables with something like a symmetric distribution, you may use the analytical standard deviates, but do adjust for multiple comparisons. The analytical standard deviates compare the difference between observed and expected local Ii, standardised by the square root if the variance of the statistic. You may also use conditional permutations, or the saddlepoint approximation or indeed exact standard deviate estimates.

Ramon-88 commented 1 year ago

Dear Roger, thanks for the explanation and the reference you sent me, I had a chance to look into them a few days ago, and they were really helpful in clarifying the inference process of Moran's local statistics. I have finally come to the conclusion of the analysis, but there is something that does not make sense to me and I would like to kindly ask for your opinion.

After running the Local Moran on my dataset using "conditional = T" as an argument and using a matrix of inverse distances with specific thresholds as weights, I tried mapping the results, following what the procedure you suggested in this tutorial https://rsbivand.github.io/ECS530_h21/ECS530_211117.html but looking at the map I noticed some unusual results. Basically there are grids of high-low clusters surrounded by grids of high-high clusters and vice versa, there are grids of low-high clusters surrounded by grids of low-low clusters. I was wondering whether such results make sense or not, as one usually expects high-high clusters to be surrounded (in case) by low-high clusters and vice versa. But this is not what happens with my data. And in this case, would a 'high-low' cluster cell surrounded by high-high cells (referring to the over-representation of a social group) indicate an extreme case of over-representation of that social group and vice versa with low-high cells?

I attach a screen of the map, I know it is not good practice but I really can't figure out how to reproduce these results with an example code. screen map

As always, thanks in advance for any feedbacks or suggestions.

rsbivand commented 1 year ago

From my point of view, please never try to interpret maps of this kind. Without first removing the effect of missing variables, wrongly defined entities, and underlying global or regional spatial processes, any interpretation would be largely meaningless. That is, exploration is OK, interpretation rather not so much.