Open Sixto233 opened 1 month ago
Please add the output of sessionInfo
to show package versions and platform.
Please add the output of
sessionInfo
to show package versions and platform.
R version 4.3.1 (2023-06-16 ucrt) Platform: x86_64-w64-mingw32/x64 (64-bit) Running under: Windows 11 x64 (build 22631)
Matrix products: default
locale:
[1] LC_COLLATE=Spanish_Argentina.utf8 LC_CTYPE=Spanish_Argentina.utf8
[3] LC_MONETARY=Spanish_Argentina.utf8 LC_NUMERIC=C
[5] LC_TIME=Spanish_Argentina.utf8
time zone: Europe/London tzcode source: internal
attached base packages: [1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] mapview_2.11.2 corrplot_0.92 caret_6.0-94
[4] lattice_0.22-5 lmtest_0.9-40 zoo_1.8-12
[7] car_3.1-2 carData_3.0-5 DT_0.31
[10] spatstat_3.0-7 spatstat.linnet_3.1-3 spatstat.model_3.2-8
[13] rpart_4.1.19 spatstat.explore_3.2-5 nlme_3.1-162
[16] spatstat.random_3.2-2 spatstat.geom_3.2-8 spatstat.data_3.0-4
[19] sp_2.1-1 spdep_1.2-8 spatialreg_1.2-9
[22] Matrix_1.5-4.1 spData_2.3.0 rmarkdown_2.25
[25] osmdata_0.2.5 sf_1.0-14 lubridate_1.9.3
[28] forcats_1.0.0 stringr_1.5.0 dplyr_1.1.3
[31] purrr_1.0.2 readr_2.1.4 tidyr_1.3.0
[34] tibble_3.2.1 ggplot2_3.4.4 tidyverse_2.0.0
loaded via a namespace (and not attached):
[1] RColorBrewer_1.1-3 jsonlite_1.8.7 rstudioapi_0.15.0
[4] wk_0.8.0 magrittr_2.0.3 TH.data_1.1-2
[7] spatstat.utils_3.0-4 vctrs_0.6.4 base64enc_0.1-3
[10] terra_1.7-55 htmltools_0.5.6.1 curl_5.1.0
[13] raster_3.6-26 s2_1.1.4 LearnBayes_2.15.1
[16] pROC_1.18.5 parallelly_1.36.0 KernSmooth_2.23-21
[19] htmlwidgets_1.6.2 httr2_0.2.3 plyr_1.8.9
[22] sandwich_3.0-2 uuid_1.1-1 lifecycle_1.0.3
[25] iterators_1.0.14 pkgconfig_2.0.3 R6_2.5.1
[28] fastmap_1.1.1 future_1.33.0 digest_0.6.33
[31] colorspace_2.1-0 tensor_1.5 leafem_0.2.3
[34] crosstalk_1.2.0 fansi_1.0.5 spatstat.sparse_3.0-3
[37] timechange_0.2.0 polyclip_1.10-6 abind_1.4-5
[40] mgcv_1.8-42 compiler_4.3.1 proxy_0.4-27
[43] withr_2.5.1 brew_1.0-8 DBI_1.1.3
[46] MASS_7.3-60 lava_1.7.3 rappdirs_0.3.3
[49] leaflet_2.2.0 classInt_0.4-10 ModelMetrics_1.2.2.2
[52] tools_4.3.1 units_0.8-4 future.apply_1.11.1
[55] nnet_7.3-19 goftest_1.2-3 satellite_1.0.4
[58] glue_1.6.2 grid_4.3.1 reshape2_1.4.4
[61] generics_0.1.3 recipes_1.0.9 leaflet.providers_2.0.0
[64] gtable_0.3.4 tzdb_0.4.0 class_7.3-22
[67] data.table_1.14.8 hms_1.1.3 xml2_1.3.5
[70] utf8_1.2.3 foreach_1.5.2 pillar_1.9.0
[73] splines_4.3.1 survival_3.5-5 deldir_1.0-9
[76] tidyselect_1.2.0 knitr_1.44 svglite_2.1.2
[79] stats4_4.3.1 xfun_0.40 expm_0.999-7
[82] leafpop_0.1.0 hardhat_1.3.1 timeDate_4032.109
[85] stringi_1.7.12 yaml_2.3.7 boot_1.3-28.1
[88] evaluate_0.22 codetools_0.2-19 cli_3.6.1
[91] systemfonts_1.0.5 jquerylib_0.1.4 munsell_0.5.0
[94] Rcpp_1.0.11 globals_0.16.2 coda_0.19-4
[97] png_0.1-8 parallel_4.3.1 ellipsis_0.3.2
[100] gower_1.0.1 listenv_0.9.0 mvtnorm_1.2-3
[103] ipred_0.9-14 scales_1.2.1 prodlim_2023.08.28
[106] e1071_1.7-13 rlang_1.1.1 multcomp_1.4-25
Please update spdep
and spatialreg
, both are not current. It may not help, but at least we'll be looking at the same software (R could also be upgraded, but is less important).
Updates completed. They have not solved the issue
library(sf)
library(spatialreg)
crd <- st_read("Cordoba_nonworking_subset.geojson")
base_formula <- "log(pm2) ~ property_type + log(surface_total) + log(dist_hospital) +
log(dist_school) + log(dist_university) +
log(dist_parque) + count_restaurants +
nearest_park_area + log(dist_alcaldia) + log(dist_alcaldia):before_after_covid +
before_after_covid:log(dist_parque) + before_after_covid:count_restaurants +
before_after_covid:nearest_park_area"
formula_obj <- as.formula(base_formula)
model <- lm(formula = formula_obj, data = crd)
attr(model$model, "na.action")
> attr(model$model, "na.action")
2255
2255
attr(,"class")
[1] "omit"
> crd[2255,]
Simple feature collection with 1 feature and 39 fields
Geometry type: POINT
Dimension: XY
Bounding box: xmin: -64.24149 ymin: -31.45279 xmax: -64.24149 ymax: -31.45279
Geodetic CRS: WGS 84
ad_id operation place_l1 place_l2 place_l3 place_l4 place_l5 place_l6
2255 3545773 Venta Argentina Córdoba Córdoba <NA>
price currency_id price_usd rooms bedrooms bathrooms surface_total
2255 129900 USD 129900 6 2 2 125
surface_covered property_type created_on end_date pm2
2255 105 Casa <NA> 2019-08-05 20:45:02 1237.143
dist_location dist_bank dist_restaurant dist_hospital dist_school
2255 6218.902 2680.334 2662.335 2688.951 543.5828
dist_university dist_public_transport dist_shopping_center
2255 3521.737 6501.873 5550.379
dist_police_station dist_fire_station dist_parque dist_tren dist_rio
2255 1755.635 16673.52 418.2562 28612.72 3843.845
dist_alcaldia count_banks count_restaurants before_after_covid days_covid
2255 6218.902 0 0 <NA> NA
nearest_park_area geometry
2255 13889.35 POINT (-64.24149 -31.45279)
In addition to the missing data, the complete analysis is undermined by the next steps:
library(spdep)
d <- dnearneigh(cord, 0, 500)
dlist <- nbdists(d, sp::coordinates(cord))
idlist <- lapply(dlist, function(x) 1/x)
lw <- nb2listw(d, glist=idlist, style="W", zero.policy = TRUE)
while the points are in geographical coordinates. Do you really need distances in degrees? 500 degrees is more than once around the Earth. I suggest rather:
knn12 <- knn2nb(knearneigh(crd, k=12), sym=TRUE)
dlist <- nbdists(knn12, crd)
idlist <- lapply(dlist, function(x) 1/x)
lw <- nb2listw(knn12, glist=idlist, style="W", zero.policy = TRUE)
in which k=12 is chosen to link the two major goups of points together, using the geographical coordinates properly throughout.
The first argument to sacsarlm
could rather have been formula_obj
rather than model
, but the missing data is of course the same.
I wonder also whether you realise that the default method
argument constructs a dense weights matrix to extract the eigenvalues, so that "Matrix"
or "LU"
using sparse matrices might have been preferable - see the comments in Bivand et al. (2013) and Bivand and Piras (2015). You could then readily analyse all 50K properties in one model.
I've been working on fitting a spatial autoregressive model (sacsarlm) to property price data in Córdoba, Argentina. While the model runs smoothly on smaller datasets, I've encountered an error when using larger samples, specifically when working with a 4K. observations dataset
.Here’s a breakdown of what happened, what I suspect the issue is, and some potential fixes.
The Datasets For context, I’m working with three versions of the Cordoba dataset:
Cordoba Full Dataset: The complete dataset with all observations. Cordoba Non-Working Sample (4K observations): A subset that consistently throws an error. Cordoba Working Sample (1.5K observations): A smaller subset that runs without issues.
Here are the versions: https://drive.google.com/drive/folders/1VoJ-Rbb9uwh68q-lHQajh7qubeD9olAH?usp=sharing
Str of Cordoba
Geometry:
Type: sfc_POINT Description: Each entry in the dataset is represented by a point geometry, capturing the precise location of properties in Córdoba. CRS: WGS 84
The Error The issue arises when I attempt to fit the SACARLM model with the 4K observation subset.
My libraries for this code
This is the spatial model creating a linear regression then applying it into the sacsar creating list weights based on a 0-500 neighbour catchment area and then using the inverse distance to weight the neighbours. The error appears trying to pass sac <- sacsarlm(model, listw = lw, data = data, zero.policy = TRUE)
The error message I get is:
(Error in subset.listw(listw, subset, zero.policy = zero.policy) : Not yet able to subset general weights lists 5.stop("Not yet able to subset general weights lists")
The way Cordoba Full becomes subsets is through a bigger dataset from Argentina in the following way.
One can easily make Cordoba_subset with replacing dataset = Argentina with Cordoba.
Key steps Removed NAs: Filters were added to remove any rows with NA values in pm2, surface_total, and geometry to prevent issues with the spatial model fitting. Jittering: Applied jitter to the spatial data to prevent infinite values in the weight matrices that could cause problems during spatial regression.
So the following paths are taking for the formulas. First we analyze the location and then we run the regression as:
If anybody could help I would be really thankful