r-spatial / spatialreg

spatialreg: spatial models estimation and testing
https://r-spatial.github.io/spatialreg/
43 stars 11 forks source link

An error occurred while fitting the spatial model: Not yet able to subset general weights lists: SAC SAR LM and lw matrixes #54

Open Sixto233 opened 1 month ago

Sixto233 commented 1 month ago

I've been working on fitting a spatial autoregressive model (sacsarlm) to property price data in Córdoba, Argentina. While the model runs smoothly on smaller datasets, I've encountered an error when using larger samples, specifically when working with a 4K. observations dataset

.Here’s a breakdown of what happened, what I suspect the issue is, and some potential fixes.

The Datasets For context, I’m working with three versions of the Cordoba dataset:

Cordoba Full Dataset: The complete dataset with all observations. Cordoba Non-Working Sample (4K observations): A subset that consistently throws an error. Cordoba Working Sample (1.5K observations): A smaller subset that runs without issues.

Here are the versions: https://drive.google.com/drive/folders/1VoJ-Rbb9uwh68q-lHQajh7qubeD9olAH?usp=sharing

Str of Cordoba

Geometry:

Type: sfc_POINT Description: Each entry in the dataset is represented by a point geometry, capturing the precise location of properties in Córdoba. CRS: WGS 84

The Error The issue arises when I attempt to fit the SACARLM model with the 4K observation subset.

My libraries for this code

library("tidyverse")     # For data manipulation and visualization
library("ggplot2")       # For advanced plotting
library("sf")            # For handling spatial data
library("spatialreg")    # For spatial regression models
library("spdep")         # For spatial weights and dependencies
library("car")           # For Variance Inflation Factor (VIF) and other diagnostics
library("lmtest")        # For diagnostic tests on linear models
library("mapview")       # For interactive mapping and visualization

This is the spatial model creating a linear regression then applying it into the sacsar creating list weights based on a 0-500 neighbour catchment area and then using the inverse distance to weight the neighbours. The error appears trying to pass sac <- sacsarlm(model, listw = lw, data = data, zero.policy = TRUE)

fit_spatial_model <- function(data, has_river = FALSE, has_coastline = FALSE) {
    # Initialize the base formula with the main predictors for the model.
    base_formula <- "log(pm2) ~ property_type + log(surface_total) + log(dist_hospital) + 
    log(dist_school) + log(dist_university) + 
    log(dist_parque) + count_restaurants + 
    nearest_park_area + log(dist_alcaldia) + log(dist_alcaldia):before_after_covid + 
    before_after_covid:log(dist_parque) + before_after_covid:count_restaurants + 
    before_after_covid:nearest_park_area"

    # If the city has a river, add the distance to the river as a predictor in the formula.
    if (has_river) {
        base_formula <- paste(base_formula, "+ log(dist_rio)")
    }

    # If the city has a coastline, add the distance to the coast as a predictor in the formula.
    if (has_coastline) {
        base_formula <- paste(base_formula, "+ log(dist_costa)")
    }

    # Convert the base formula (string) to an actual formula object in R.
    formula_obj <- as.formula(base_formula)

    # Fit a linear model using the specified formula and dataset.
    model <- lm(formula = formula_obj, data = data)

    # Print the summary of the linear model, which includes coefficients, standard errors, etc.
    print(summary(model))

    # Calculate Variance Inflation Factor (VIF) values to check for multicollinearity among predictors.
    vif_values <- vif(model)

    # Print the VIF values to identify potential multicollinearity issues.
    print(vif_values)

    # Extract the coordinates from the spatial data's geometry column.
    cord <- st_coordinates(data$geometry)

    # Create a neighborhood structure (spatial neighbors) for points within 500 units of distance.
    d <- dnearneigh(cord, 0, 500)

    # Calculate distances between neighboring points.
    dlist <- nbdists(d, coordinates(cord))

    # Create inverse distance weights for each neighbor relationship.
    idlist <- lapply(dlist, function(x) 1/x)

    # Convert the neighborhood structure and weights into a spatial weights list object.
    lw <- nb2listw(d, glist=idlist, style="W", zero.policy = TRUE)

    # Fit a spatial autoregressive model (SAR) using the linear model and the spatial weights.
    sac <- sacsarlm(model, listw = lw, data = data, zero.policy = TRUE)

    # Generate a summary of the SAR model, including the Nagelkerke pseudo R-squared.
    sac_summary <- summary(sac, Nagelkerke = TRUE)

    # Print the summary of the SAR model to review model performance and diagnostics.
    print(sac_summary)

    # Calculate and print the impacts (direct, indirect, total) of the SAR model.
    impacts <- impacts(sac, listw = lw, zero.policy = TRUE)
    print(impacts)

    # Return a list containing summaries of the linear model, SAR model, and impacts.
    return(list(model_summary = summary(model), sac_summary = sac_summary, impacts = impacts))
}

The error message I get is:

(Error in subset.listw(listw, subset, zero.policy = zero.policy) : Not yet able to subset general weights lists 5.stop("Not yet able to subset general weights lists")

  1. subset.listw(listw, subset, zero.policy = zero.policy)
  2. subset(listw, subset, zero.policy = zero.policy)
  3. sacsarlm(model, listw = lw, data = data, zero.policy = TRUE) 1.fit_spatial_model(Cordoba, has_river = TRUE)

The way Cordoba Full becomes subsets is through a bigger dataset from Argentina in the following way.

analyze_location <- function(lat, lon, dataset, distance_threshold = 55000, sample_size = 4000, jitter_amount = 2 / 111320, pm2_min = 1, pm2_max = 8000, Coastline1 = FALSE) {

  # Create Central Business District (CBD) point
  location_cbd <- st_as_sf(data.frame(lat = lat, lon = lon), coords = c("lon", "lat"), crs = 4326)

  # Filter dataset by distance from CBD
  location_data <- dataset %>%
    mutate(dist_location = as.numeric(st_distance(geometry, location_cbd))) %>%
    filter(dist_location < distance_threshold)

  # Randomly sample the filtered data
  location_sample <- sample_n(location_data, sample_size)

  # Calculate bounding box with buffer for the sampled data
  location_sample_bb <- st_bbox(st_buffer(st_as_sf(st_as_sfc(st_bbox(location_sample))), 3000))

  # Calculate price per square meter (pm2)
  location_sample <- location_sample %>%
    mutate(price = as.numeric(price), 
           surface_covered = as.numeric(surface_covered),
           surface_total = as.numeric(surface_total)) %>%
    filter(property_type %in% c("Casa", "Lote", "Departamento")) %>%
    mutate(pm2 = case_when(
      property_type %in% c("Casa", "Departamento") ~ price / if_else(is.na(surface_covered) | surface_covered == 0, surface_total, surface_covered),
      property_type == "Lote" ~ price / surface_total,
      TRUE ~ NA_real_)) %>%
    distinct(geometry, pm2, .keep_all = TRUE)

  # Calculate distances and apply jitter
  location_sample <- calculadora(location_sample, alcaldia = location_cbd, Coastline = Coastline1) %>%
    st_jitter(amount = jitter_amount)

  # Filter by pm2 range and non-empty surface_total
  location_sample <- location_sample %>%
    filter(pm2 > pm2_min, pm2 < pm2_max, surface_total != "")

  return(location_sample)
}

# Example usage:
Cordoba <- analyze_location(lon = -64.188441, lat = -31.419912, dataset = Argentina, jitter_amount = 2 / 111320, sample_size = 1500)

One can easily make Cordoba_subset with replacing dataset = Argentina with Cordoba.

Cordoba_1500 <- analyze_location(lon = -64.188441, lat = -31.419912, dataset = Cordoba, jitter_amount = 2 / 111320, sample_size = 1500)

Key steps Removed NAs: Filters were added to remove any rows with NA values in pm2, surface_total, and geometry to prevent issues with the spatial model fitting. Jittering: Applied jitter to the spatial data to prevent infinite values in the weight matrices that could cause problems during spatial regression.

So the following paths are taking for the formulas. First we analyze the location and then we run the regression as:

Cordoba = analyze_location( lon = -64.188441, lat = -31.419912, dataset = Cordoba_Full,  jitter_amount = 2 / 111320, sample_size = 1500)
Cordoba_spatial_Model = fit_spatial_model (Cordoba, has_river = TRUE,  has_coastline = FALSE )

If anybody could help I would be really thankful

rsbivand commented 1 month ago

Please add the output of sessionInfo to show package versions and platform.

Sixto233 commented 1 month ago

Please add the output of sessionInfo to show package versions and platform.

R version 4.3.1 (2023-06-16 ucrt) Platform: x86_64-w64-mingw32/x64 (64-bit) Running under: Windows 11 x64 (build 22631)

Matrix products: default

locale: [1] LC_COLLATE=Spanish_Argentina.utf8 LC_CTYPE=Spanish_Argentina.utf8
[3] LC_MONETARY=Spanish_Argentina.utf8 LC_NUMERIC=C
[5] LC_TIME=Spanish_Argentina.utf8

time zone: Europe/London tzcode source: internal

attached base packages: [1] stats graphics grDevices utils datasets methods base

other attached packages: [1] mapview_2.11.2 corrplot_0.92 caret_6.0-94
[4] lattice_0.22-5 lmtest_0.9-40 zoo_1.8-12
[7] car_3.1-2 carData_3.0-5 DT_0.31
[10] spatstat_3.0-7 spatstat.linnet_3.1-3 spatstat.model_3.2-8
[13] rpart_4.1.19 spatstat.explore_3.2-5 nlme_3.1-162
[16] spatstat.random_3.2-2 spatstat.geom_3.2-8 spatstat.data_3.0-4
[19] sp_2.1-1 spdep_1.2-8 spatialreg_1.2-9
[22] Matrix_1.5-4.1 spData_2.3.0 rmarkdown_2.25
[25] osmdata_0.2.5 sf_1.0-14 lubridate_1.9.3
[28] forcats_1.0.0 stringr_1.5.0 dplyr_1.1.3
[31] purrr_1.0.2 readr_2.1.4 tidyr_1.3.0
[34] tibble_3.2.1 ggplot2_3.4.4 tidyverse_2.0.0

loaded via a namespace (and not attached): [1] RColorBrewer_1.1-3 jsonlite_1.8.7 rstudioapi_0.15.0
[4] wk_0.8.0 magrittr_2.0.3 TH.data_1.1-2
[7] spatstat.utils_3.0-4 vctrs_0.6.4 base64enc_0.1-3
[10] terra_1.7-55 htmltools_0.5.6.1 curl_5.1.0
[13] raster_3.6-26 s2_1.1.4 LearnBayes_2.15.1
[16] pROC_1.18.5 parallelly_1.36.0 KernSmooth_2.23-21
[19] htmlwidgets_1.6.2 httr2_0.2.3 plyr_1.8.9
[22] sandwich_3.0-2 uuid_1.1-1 lifecycle_1.0.3
[25] iterators_1.0.14 pkgconfig_2.0.3 R6_2.5.1
[28] fastmap_1.1.1 future_1.33.0 digest_0.6.33
[31] colorspace_2.1-0 tensor_1.5 leafem_0.2.3
[34] crosstalk_1.2.0 fansi_1.0.5 spatstat.sparse_3.0-3
[37] timechange_0.2.0 polyclip_1.10-6 abind_1.4-5
[40] mgcv_1.8-42 compiler_4.3.1 proxy_0.4-27
[43] withr_2.5.1 brew_1.0-8 DBI_1.1.3
[46] MASS_7.3-60 lava_1.7.3 rappdirs_0.3.3
[49] leaflet_2.2.0 classInt_0.4-10 ModelMetrics_1.2.2.2
[52] tools_4.3.1 units_0.8-4 future.apply_1.11.1
[55] nnet_7.3-19 goftest_1.2-3 satellite_1.0.4
[58] glue_1.6.2 grid_4.3.1 reshape2_1.4.4
[61] generics_0.1.3 recipes_1.0.9 leaflet.providers_2.0.0 [64] gtable_0.3.4 tzdb_0.4.0 class_7.3-22
[67] data.table_1.14.8 hms_1.1.3 xml2_1.3.5
[70] utf8_1.2.3 foreach_1.5.2 pillar_1.9.0
[73] splines_4.3.1 survival_3.5-5 deldir_1.0-9
[76] tidyselect_1.2.0 knitr_1.44 svglite_2.1.2
[79] stats4_4.3.1 xfun_0.40 expm_0.999-7
[82] leafpop_0.1.0 hardhat_1.3.1 timeDate_4032.109
[85] stringi_1.7.12 yaml_2.3.7 boot_1.3-28.1
[88] evaluate_0.22 codetools_0.2-19 cli_3.6.1
[91] systemfonts_1.0.5 jquerylib_0.1.4 munsell_0.5.0
[94] Rcpp_1.0.11 globals_0.16.2 coda_0.19-4
[97] png_0.1-8 parallel_4.3.1 ellipsis_0.3.2
[100] gower_1.0.1 listenv_0.9.0 mvtnorm_1.2-3
[103] ipred_0.9-14 scales_1.2.1 prodlim_2023.08.28
[106] e1071_1.7-13 rlang_1.1.1 multcomp_1.4-25

rsbivand commented 1 month ago

Please update spdep and spatialreg, both are not current. It may not help, but at least we'll be looking at the same software (R could also be upgraded, but is less important).

Sixto233 commented 1 month ago

Updates completed. They have not solved the issue

rsbivand commented 1 month ago
library(sf)
library(spatialreg)
crd <- st_read("Cordoba_nonworking_subset.geojson")
base_formula <- "log(pm2) ~ property_type + log(surface_total) + log(dist_hospital) + 
    log(dist_school) + log(dist_university) + 
    log(dist_parque) + count_restaurants + 
    nearest_park_area + log(dist_alcaldia) + log(dist_alcaldia):before_after_covid + 
    before_after_covid:log(dist_parque) + before_after_covid:count_restaurants + 
    before_after_covid:nearest_park_area"
formula_obj <- as.formula(base_formula)
model <- lm(formula = formula_obj, data = crd)
attr(model$model, "na.action")
> attr(model$model, "na.action")
2255 
2255 
attr(,"class")
[1] "omit"
> crd[2255,]
Simple feature collection with 1 feature and 39 fields
Geometry type: POINT
Dimension:     XY
Bounding box:  xmin: -64.24149 ymin: -31.45279 xmax: -64.24149 ymax: -31.45279
Geodetic CRS:  WGS 84
       ad_id operation  place_l1 place_l2 place_l3 place_l4 place_l5 place_l6
2255 3545773     Venta Argentina  Córdoba  Córdoba                       <NA>
      price currency_id price_usd rooms bedrooms bathrooms surface_total
2255 129900         USD    129900     6        2         2           125
     surface_covered property_type created_on            end_date      pm2
2255             105          Casa       <NA> 2019-08-05 20:45:02 1237.143
     dist_location dist_bank dist_restaurant dist_hospital dist_school
2255      6218.902  2680.334        2662.335      2688.951    543.5828
     dist_university dist_public_transport dist_shopping_center
2255        3521.737              6501.873             5550.379
     dist_police_station dist_fire_station dist_parque dist_tren dist_rio
2255            1755.635          16673.52    418.2562  28612.72 3843.845
     dist_alcaldia count_banks count_restaurants before_after_covid days_covid
2255      6218.902           0                 0               <NA>         NA
     nearest_park_area                    geometry
2255          13889.35 POINT (-64.24149 -31.45279)

In addition to the missing data, the complete analysis is undermined by the next steps:

library(spdep)
d <- dnearneigh(cord, 0, 500)
dlist <- nbdists(d, sp::coordinates(cord))
 idlist <- lapply(dlist, function(x) 1/x)
lw <- nb2listw(d, glist=idlist, style="W", zero.policy = TRUE)

while the points are in geographical coordinates. Do you really need distances in degrees? 500 degrees is more than once around the Earth. I suggest rather:

knn12 <- knn2nb(knearneigh(crd, k=12), sym=TRUE)
dlist <- nbdists(knn12, crd)
idlist <- lapply(dlist, function(x) 1/x)
lw <- nb2listw(knn12, glist=idlist, style="W", zero.policy = TRUE)

in which k=12 is chosen to link the two major goups of points together, using the geographical coordinates properly throughout.

The first argument to sacsarlm could rather have been formula_obj rather than model, but the missing data is of course the same.

I wonder also whether you realise that the default method argument constructs a dense weights matrix to extract the eigenvalues, so that "Matrix" or "LU" using sparse matrices might have been preferable - see the comments in Bivand et al. (2013) and Bivand and Piras (2015). You could then readily analyse all 50K properties in one model.