Checking for spatial autocorrelation in RF models

lindsayplatt commented 4 months ago

The point was raised that there may be some bias in the random forest model outputs due to spatial autocorrelation since some of the sites are located very near each other (see the map of sites and the DC area for example). I went through a short exercise to explore this and see if it might be impacting our results. Note that this exercise did not use perfectly optimized random forest models.

The tools: Two R packages were shared with me for running random forest models that are geographically weighted, SpatialML and spatialRF both of which depend on the ranger package. I tested a spatially aware random forest using the SpatialML package and randomly sampling sites before running randomForest::randomForest(). I did not use the spatialRF package because spatialRF is currently only set up for numeric response variables (regression) or at most binary categorical response variables. We have categorical response variables ranging from 2 - 4 categories so this package would not be appropriate to use.

The metric: I am comparing all tests based on the importance rankings of all the variables using the Gini index for the overall mean.

lindsayplatt commented 4 months ago

A basic random forest using randomForest::randomForest() (no spatial weighting or awareness) was run 3 different times with different seeds. All three times the results showed the exact same top 6 predictors and the exact same bottom 3 predictors. The remaining 7 predictors in the middle shifted around slightly.

Top 6 predictors for each run:

medianFlow
transmissivity
depthToWT
pctDeveloped
pctForested
gwRecharge

Bottom 3 predictors for each run:

pctWetland
pctAgriculture
roadSaltPerSqKm

Code for running is available here

```r library(randomForest) library(targets) site_attr_data <- tar_read(p6_site_attr_rf) # Try a regular random forest x3 set.seed(81) rf1 <- randomForest( site_category_fact ~ ., data = site_attr_data, importance = TRUE) rf1$importance[,'MeanDecreaseGini'] set.seed(1999) rf2 <- randomForest( site_category_fact ~ ., data = site_attr_data, importance = TRUE) rf2$importance[,'MeanDecreaseGini'] set.seed(457) rf3 <- randomForest( site_category_fact ~ ., data = site_attr_data, importance = TRUE) rf3$importance[,'MeanDecreaseGini'] ```

Raw console output is here

``` > rf1$importance[,'MeanDecreaseGini'] medianFlow roadSaltPerSqKm annualPrecip baseFlowInd 8.080285 3.101560 3.516853 4.859056 subsurfaceContact gwRecharge pctOpenWater basinSlope 8.789175 4.962582 4.593412 4.689974 pctForested pctWetland pctAgriculture pctDeveloped 9.839009 3.242923 3.150534 6.253507 annualSnow winterAirTemp depthToWT transmissivity 5.676833 4.295210 6.703200 7.437315 > rf2$importance[,'MeanDecreaseGini'] medianFlow roadSaltPerSqKm annualPrecip baseFlowInd 10.475497 2.738507 3.997678 4.428564 subsurfaceContact gwRecharge pctOpenWater basinSlope 3.879675 4.938253 4.816929 4.570815 pctForested pctWetland pctAgriculture pctDeveloped 8.823849 3.268288 3.224577 6.091679 annualSnow winterAirTemp depthToWT transmissivity 4.782634 4.559520 7.006692 7.237381 > rf3$importance[,'MeanDecreaseGini'] medianFlow roadSaltPerSqKm annualPrecip baseFlowInd 11.458622 2.871415 3.683979 4.961721 subsurfaceContact gwRecharge pctOpenWater basinSlope 3.970846 5.050246 4.395039 4.750700 pctForested pctWetland pctAgriculture pctDeveloped 5.800442 3.459927 3.255676 6.593140 annualSnow winterAirTemp depthToWT transmissivity 4.217145 4.325106 6.711718 6.959164 ```

lindsayplatt commented 4 months ago

Next I tried applying spatial weighting to the random forest modeling by using SpatialML::grf(). I tried both with and without the geo-weighting just to compare the basic random forest use-case to the randomForest::randomForest() approach. Unfortunately, there is some issue with the package and I keep getting failures near the end of the function; however, it does give me an importance vector before failing, so I have copied that output to use for this comparison.

The new non-spatially weighted random forest output has the exact same order of the top 6 predictors as the 3 runs from the randomForest::randomForest() approach. It also has the the same bottom predictors but 14 & 15 are switched.

For the spatially aware random forests, the top 6 same predictors appear again and again. Two of the three tests have the second and third highest importance predictors (transmissivity and depthToWT) switched but all others appear exactly in the same order. All of these spatially aware random forests have the exact same bottom 3 predictors but the order of the bottom 3 is slightly different in each run, though all have Gini indices within 0.36 of each other, so they are all very similar.

Code for running this is here

```r library(randomForest) library(sf) library(SpatialML) library(targets) library(tidyverse) sites_xy <- tar_read(p1_nwis_sc_sites_sf) %>% cbind(st_coordinates(.)) %>% select(site_no, X, Y) %>% st_drop_geometry() site_attrs_xy <- tar_read(p6_site_attr) %>% left_join(sites_xy, by = 'site_no') %>% relocate(X, Y, .after = site_category_fact) site_attr_data <- site_attrs_xy %>% select(-site_no, -X, -Y) site_xy_data <- site_attrs_xy %>% select(X, Y) # Use the grf function without spatial weighting first set.seed(21) rf_spatnone <- grf( site_category_fact ~ medianFlow + roadSaltPerSqKm + annualPrecip + baseFlowInd + subsurfaceContact + gwRecharge + pctOpenWater + basinSlope + pctForested + pctWetland + pctAgriculture + pctDeveloped + annualSnow + winterAirTemp + depthToWT + transmissivity, dframe = site_attr_data, bw=5, kernel = "adaptive", geo.weighted = FALSE) # Try a spatially aware random forest set.seed(81) rf_spat1 <- grf( site_category_fact ~ medianFlow + roadSaltPerSqKm + annualPrecip + baseFlowInd + subsurfaceContact + gwRecharge + pctOpenWater + basinSlope + pctForested + pctWetland + pctAgriculture + pctDeveloped + annualSnow + winterAirTemp + depthToWT + transmissivity, dframe = site_attr_data, coords = site_xy_data, bw=10, kernel = "adaptive", geo.weighted = TRUE) set.seed(1999) rf_spat2 <- grf( site_category_fact ~ medianFlow + roadSaltPerSqKm + annualPrecip + baseFlowInd + subsurfaceContact + gwRecharge + pctOpenWater + basinSlope + pctForested + pctWetland + pctAgriculture + pctDeveloped + annualSnow + winterAirTemp + depthToWT + transmissivity, dframe = site_attr_data, coords = site_xy_data, bw=10, kernel = "adaptive", geo.weighted = TRUE) set.seed(457) rf_spat3 <- grf( site_category_fact ~ medianFlow + roadSaltPerSqKm + annualPrecip + baseFlowInd + subsurfaceContact + gwRecharge + pctOpenWater + basinSlope + pctForested + pctWetland + pctAgriculture + pctDeveloped + annualSnow + winterAirTemp + depthToWT + transmissivity, dframe = site_attr_data, coords = site_xy_data, bw=10, kernel = "adaptive", geo.weighted = TRUE) ```

Raw console output (including errors)

``` > rf_spatnone <- grf( + site_category_fact ~ medianFlow + roadSaltPerSqKm + annualPrecip + + baseFlowInd + subsurfaceContact + gwRecharge + pctOpenWater + + basinSlope + pctForested + pctWetland + pctAgriculture + + pctDeveloped + annualSnow + winterAirTemp + depthToWT + transmissivity, + dframe = site_attr_data, + coords = site_xy_data, + bw=10, + kernel = "adaptive", + geo.weighted = FALSE) Number of Observations: 124 Number of Independent Variables: 16 Kernel: Adaptive Neightbours: 10 --------------- Global ML Model Summary --------------- Ranger result Call: ranger(site_category_fact ~ medianFlow + roadSaltPerSqKm + annualPrecip + baseFlowInd + subsurfaceContact + gwRecharge + pctOpenWater + basinSlope + pctForested + pctWetland + pctAgriculture + pctDeveloped + annualSnow + winterAirTemp + depthToWT + transmissivity, data = site_attr_data, num.trees = 500, mtry = 5, importance = "impurity", num.threads = NULL) Type: Classification Number of trees: 500 Sample size: 124 Number of independent variables: 16 Mtry: 5 Target node size: 1 Variable importance mode: impurity Splitrule: gini OOB prediction error: 44.35 % Importance: medianFlow roadSaltPerSqKm annualPrecip baseFlowInd 8.873355 2.788235 3.489543 4.369558 subsurfaceContact gwRecharge pctOpenWater basinSlope 3.865206 5.329389 4.465200 4.555535 pctForested pctWetland pctAgriculture pctDeveloped 6.227573 2.986786 3.127494 6.693943 annualSnow winterAirTemp depthToWT transmissivity 4.522379 4.385249 6.791867 7.164627 Mean Square Error (Not OOB): NA R-squared (Not OOB) %: NA AIC (Not OOB): NA AICc (Not OOB): NA Error in x[[jj]][iseq] <- vjj : replacement has length zero In addition: Warning messages: 1: In Ops.factor(Y, yhat) : ‘-’ not meaningful for factors 2: In mean.default(Y) : argument is not numeric or logical: returning NA 3: In Ops.factor(Y, g.mean.y) : ‘-’ not meaningful for factors 4: Dropped unused factor level(s) in dependent variable: Neither. ``` ``` > set.seed(81) > rf_spat1 <- grf( + site_category_fact ~ medianFlow + roadSaltPerSqKm + annualPrecip + + baseFlowInd + subsurfaceContact + gwRecharge + pctOpenWater + + basinSlope + pctForested + pctWetland + pctAgriculture + + pctDeveloped + annualSnow + winterAirTemp + depthToWT + transmissivity, + dframe = site_attr_data, + coords = site_xy_data, + bw=5, + kernel = "adaptive", + geo.weighted = TRUE) Number of Observations: 124 Number of Independent Variables: 16 Kernel: Adaptive Neightbours: 5 --------------- Global ML Model Summary --------------- Ranger result Call: ranger(site_category_fact ~ medianFlow + roadSaltPerSqKm + annualPrecip + baseFlowInd + subsurfaceContact + gwRecharge + pctOpenWater + basinSlope + pctForested + pctWetland + pctAgriculture + pctDeveloped + annualSnow + winterAirTemp + depthToWT + transmissivity, data = site_attr_data, num.trees = 500, mtry = 5, importance = "impurity", num.threads = NULL) Type: Classification Number of trees: 500 Sample size: 124 Number of independent variables: 16 Mtry: 5 Target node size: 1 Variable importance mode: impurity Splitrule: gini OOB prediction error: 43.55 % Importance: medianFlow roadSaltPerSqKm annualPrecip baseFlowInd 8.781732 2.874781 3.842216 4.389635 subsurfaceContact gwRecharge pctOpenWater basinSlope 3.938228 5.009712 4.438644 4.518468 pctForested pctWetland pctAgriculture pctDeveloped 6.125925 2.885710 3.119887 6.380170 annualSnow winterAirTemp depthToWT transmissivity 4.484368 4.429347 7.430216 6.785990 Mean Square Error (Not OOB): NA R-squared (Not OOB) %: NA AIC (Not OOB): NA AICc (Not OOB): NA Error in x[[jj]][iseq] <- vjj : replacement has length zero In addition: Warning messages: 1: In Ops.factor(Y, yhat) : ‘-’ not meaningful for factors 2: In mean.default(Y) : argument is not numeric or logical: returning NA 3: In Ops.factor(Y, g.mean.y) : ‘-’ not meaningful for factors 4: Dropped unused factor level(s) in dependent variable: Episodic, Neither. ``` ``` > set.seed(1999) > rf_spat2 <- grf( + site_category_fact ~ medianFlow + roadSaltPerSqKm + annualPrecip + + baseFlowInd + subsurfaceContact + gwRecharge + pctOpenWater + + basinSlope + pctForested + pctWetland + pctAgriculture + + pctDeveloped + annualSnow + winterAirTemp + depthToWT + transmissivity, + dframe = site_attr_data, + coords = site_xy_data, + bw=10, + kernel = "adaptive", + geo.weighted = TRUE) Number of Observations: 124 Number of Independent Variables: 16 Kernel: Adaptive Neightbours: 10 --------------- Global ML Model Summary --------------- Ranger result Call: ranger(site_category_fact ~ medianFlow + roadSaltPerSqKm + annualPrecip + baseFlowInd + subsurfaceContact + gwRecharge + pctOpenWater + basinSlope + pctForested + pctWetland + pctAgriculture + pctDeveloped + annualSnow + winterAirTemp + depthToWT + transmissivity, data = site_attr_data, num.trees = 500, mtry = 5, importance = "impurity", num.threads = NULL) Type: Classification Number of trees: 500 Sample size: 124 Number of independent variables: 16 Mtry: 5 Target node size: 1 Variable importance mode: impurity Splitrule: gini OOB prediction error: 42.74 % Importance: medianFlow roadSaltPerSqKm annualPrecip baseFlowInd 8.605669 2.933590 3.227732 4.606433 subsurfaceContact gwRecharge pctOpenWater basinSlope 4.114587 5.308548 4.655427 4.161007 pctForested pctWetland pctAgriculture pctDeveloped 5.591784 2.983376 2.791809 6.901469 annualSnow winterAirTemp depthToWT transmissivity 4.290466 4.797217 7.141690 7.601133 Mean Square Error (Not OOB): NA R-squared (Not OOB) %: NA AIC (Not OOB): NA AICc (Not OOB): NA Error in x[[jj]][iseq] <- vjj : replacement has length zero In addition: Warning messages: 1: In Ops.factor(Y, yhat) : ‘-’ not meaningful for factors 2: In mean.default(Y) : argument is not numeric or logical: returning NA 3: In Ops.factor(Y, g.mean.y) : ‘-’ not meaningful for factors 4: Dropped unused factor level(s) in dependent variable: Neither. ``` ``` > set.seed(457) > rf_spat3 <- grf( + site_category_fact ~ medianFlow + roadSaltPerSqKm + annualPrecip + + baseFlowInd + subsurfaceContact + gwRecharge + pctOpenWater + + basinSlope + pctForested + pctWetland + pctAgriculture + + pctDeveloped + annualSnow + winterAirTemp + depthToWT + transmissivity, + dframe = site_attr_data, + coords = site_xy_data, + bw=10, + kernel = "adaptive", + geo.weighted = TRUE) Number of Observations: 124 Number of Independent Variables: 16 Kernel: Adaptive Neightbours: 10 --------------- Global ML Model Summary --------------- Ranger result Call: ranger(site_category_fact ~ medianFlow + roadSaltPerSqKm + annualPrecip + baseFlowInd + subsurfaceContact + gwRecharge + pctOpenWater + basinSlope + pctForested + pctWetland + pctAgriculture + pctDeveloped + annualSnow + winterAirTemp + depthToWT + transmissivity, data = site_attr_data, num.trees = 500, mtry = 5, importance = "impurity", num.threads = NULL) Type: Classification Number of trees: 500 Sample size: 124 Number of independent variables: 16 Mtry: 5 Target node size: 1 Variable importance mode: impurity Splitrule: gini OOB prediction error: 42.74 % Importance: medianFlow roadSaltPerSqKm annualPrecip baseFlowInd 8.233999 2.741732 3.670807 4.136419 subsurfaceContact gwRecharge pctOpenWater basinSlope 3.865223 5.427649 4.296796 4.735193 pctForested pctWetland pctAgriculture pctDeveloped 6.079751 3.096332 3.093150 6.347658 annualSnow winterAirTemp depthToWT transmissivity 4.299432 4.782133 7.289615 7.156119 Mean Square Error (Not OOB): NA R-squared (Not OOB) %: NA AIC (Not OOB): NA AICc (Not OOB): NA Error in x[[jj]][iseq] <- vjj : replacement has length zero In addition: Warning messages: 1: In Ops.factor(Y, yhat) : ‘-’ not meaningful for factors 2: In mean.default(Y) : argument is not numeric or logical: returning NA 3: In Ops.factor(Y, g.mean.y) : ‘-’ not meaningful for factors 4: Dropped unused factor level(s) in dependent variable: Neither. ```

lindsayplatt commented 4 months ago

The final test was to randomly select 90% of the 124 sites from the full set of predictors and then re-run the random forest using randomForest::randomForest().

This produced slightly different results from the other tests so far. The top 5 was always one of the original top 6 predictors.

medianFlow was always the top predictor.
transmissivity was always second or third.
depthToWT was always in the top 5.
pctDeveloped always made the top 4.
pctForested bounced around from 7th to 5th to 6th.
gwRecharge bounced around from 5th to 8th to 4th.
The middle ones varied once again but baseFlowInd made it to 6th two of the three runs when gwRecharge and pctForested dropped down.
The bottom 3 were ALWAYS the same as all the previous bottom threes identified, only the order shifted slightly between runs.

Code for doing the random selection and rf runs

```r # Try a regular random forest with only 90% of the 124 sites (x3) library(randomForest) library(targets) library(tidyverse) site_attr_data <- tar_read(p6_site_attr) n_sites <- floor(0.90*nrow(site_attr_data)) set.seed(81) site_attr_data_rf1 <- site_attr_data %>% sample_n(n_sites) rf1 <- randomForest( site_category_fact ~ ., data = select(site_attr_data_rf1,-site_no), importance = TRUE) set.seed(1999) site_attr_data_rf2 <- site_attr_data %>% sample_n(n_sites) rf2 <- randomForest( site_category_fact ~ ., data = select(site_attr_data_rf2,-site_no), importance = TRUE) set.seed(457) site_attr_data_rf3 <- site_attr_data %>% sample_n(n_sites) rf3 <- randomForest( site_category_fact ~ ., data = select(site_attr_data_rf3,-site_no), importance = TRUE) sort(rf1$importance[,'MeanDecreaseGini']) sort(rf2$importance[,'MeanDecreaseGini']) sort(rf3$importance[,'MeanDecreaseGini']) ```

Raw console output

``` > sort(rf1$importance[,'MeanDecreaseGini']) roadSaltPerSqKm pctAgriculture pctWetland annualPrecip 2.247749 2.808558 2.853257 3.556490 subsurfaceContact winterAirTemp basinSlope annualSnow 3.602462 3.612204 4.194309 4.262866 pctOpenWater pctForested baseFlowInd gwRecharge 4.602101 4.705444 4.780140 4.957197 pctDeveloped depthToWT transmissivity medianFlow 5.056861 5.761275 6.055781 7.735594 > sort(rf2$importance[,'MeanDecreaseGini']) pctAgriculture roadSaltPerSqKm pctWetland winterAirTemp 2.755930 2.884185 2.942088 3.337007 annualPrecip pctOpenWater annualSnow subsurfaceContact 3.584371 3.608772 3.799252 3.830035 gwRecharge basinSlope baseFlowInd pctForested 4.235893 4.437618 4.568814 5.206038 pctDeveloped transmissivity depthToWT medianFlow 5.286660 5.971736 6.852171 7.224876 > sort(rf3$importance[,'MeanDecreaseGini']) roadSaltPerSqKm pctWetland pctAgriculture subsurfaceContact 2.304826 2.522030 2.728227 3.422027 annualPrecip basinSlope winterAirTemp annualSnow 3.537863 3.660098 3.668174 4.222040 pctOpenWater baseFlowInd pctForested depthToWT 4.624998 4.767915 5.133280 5.145259 gwRecharge pctDeveloped transmissivity medianFlow 5.167399 5.524642 5.986356 7.193074 ```

lindsayplatt commented 4 months ago

For me, this exercise was confirmation that spatial autocorrelation is not a huge impact on the results and interpretation of the final random forest models, and I will move forward with only using randomForest::randomForest() without any spatial weighting.

lindsayplatt commented 4 months ago

Although, one caveat to all this is that I am realizing the spatial weighting may not have worked - it might be erroring before running the actual weighted model (I think "global model" output is the unweighted and the "local model" output would be the spatially weighted). So, those are just more tests of the regular, unweighted random forest.

The 90% test may have been better but not sure if it actually randomized sites based on location well enough (green = first test, blue = second test, red = third test):

# This code should go along with the code used to run these tests
library(sf)
source("6_DefineCharacteristics/src/visualize_attribute_distributions.R")

sites_sf <- tar_read(p1_nwis_sc_sites_sf)

rf1_sites_sf <- sites_sf %>% 
  left_join(site_attr_data_rf1) %>% 
  filter(!is.na(site_category_fact))
rf2_sites_sf <- sites_sf %>% 
  left_join(site_attr_data_rf2) %>% 
  filter(!is.na(site_category_fact))
rf3_sites_sf <- sites_sf %>% 
  left_join(site_attr_data_rf3) %>% 
  filter(!is.na(site_category_fact))

ggplot() +
  add_state_basemap(tar_read(p1_conus_state_cds)) +
  geom_sf(data=rf1_sites_sf, fill = 'green',
          alpha=0.75, shape=24, size=2) +
  geom_sf(data=rf2_sites_sf,  fill = 'blue',
          alpha=0.75, shape=24, size=2) +
  geom_sf(data=rf3_sites_sf,  fill = 'red',
          alpha=0.75, shape=24, size=2) +
  scico::scale_fill_scico_d(begin = 0, end = 0.75) +
  theme_void()

lindsayplatt commented 4 months ago

Alrighty, I am trying the last step but randomly sampling from sites in distinct "groups". Groups were created based on sites within 15 km of each other (see figure below). Then, I only used one site per group and randomly sampled the site used when a group had more than one. It's not a perfect approach because sites within a certain distance will change depending on which site is being assessed and the groups are based on exact matches. Should be an OK approximation, though.

This produced slightly different results from the other tests so far. Each test was only run with 89 of the 124 sites. The top 4 was always the same top 4 as the original runs.

medianFlow was always in the top 3.
transmissivity was always 1st or 2nd.
depthToWT was always 3rd or 4th.
pctDeveloped always made the top 4.
pctForested bounced around from 11th to 7th to 13th (not really a factor as it was in the not spatially sampled runs).
gwRecharge bounced around from 10th to 6th and 6th again (somewhat important).
The middle ones varied once again but pctOpenWater was 5th for all three runs.
roadSaltPerKmSq was always in the bottom 2 and pctWetland was always in the bottom 3. pctAgriculture bounced up to 9th but was otherwise low at 12th and 13th.

I think this is a better test for spatial autocorrelation but a future improvement if I include in a paper could be creating a grid across the states and then grouping sites that fall into the same grid cell. Though, I am not sure how much of an issue this is and whether I should continue to explore or just accept some of the bias? First, I should probably evaluate whether spatial autocorrelation exists: https://mgimond.github.io/Spatial/spatial-autocorrelation.html.

Code to generate the map, do the spatial grouping/sampling, and run random forests

```r # Spatial random sampling library(randomForest) library(sf) library(tidyverse) library(targets) source("6_DefineCharacteristics/src/visualize_attribute_distributions.R") sites_sf <- tar_read(p1_nwis_sc_sites_sf) site_attr_data <- tar_read(p6_site_attr) # Group points that are within 15 km of each other site_spatial_grps <- sites_sf %>% st_is_within_distance(sites_sf, dist = 15000) %>% setNames(sites_sf$site_no) %>% map(~tibble(matching_i = list(.x))) %>% bind_rows(.id = 'site_no') %>% group_by(matching_i) %>% # Create group names. These will be used to randomly sample later. mutate(spatial_grp = sprintf('grp_%03d', cur_group_id())) %>% # Only use the unique groups to fill colors when there is more # than one site in the group mutate(spatial_grp_plot = ifelse(n() > 1, spatial_grp, NA)) %>% ungroup() %>% select(site_no, spatial_grp, spatial_grp_plot) sites_sf_grp <- sites_sf %>% left_join(site_spatial_grps, by = 'site_no') ggplot() + add_state_basemap(tar_read(p1_conus_state_cds)) + geom_sf(data=sites_sf, alpha=0.75, shape=2, size=2) + geom_sf(data=sites_sf_grp, aes(fill = spatial_grp_plot), alpha=0.75, shape=24, size=2) + scico::scale_fill_scico_d(begin = 0, end = 0.75) + theme_void() + ggtitle('Grouping sites within 15 km of each other') # Now use this approach for a 3 random forest runs set.seed(81) sites_sampled_rf1 <- site_spatial_grps %>% group_by(spatial_grp) %>% sample_n(size = 1) %>% ungroup() %>% pull(site_no) site_attr_data_rf1 <- site_attr_data %>% filter(site_no %in% sites_sampled_rf1) rf1 <- randomForest( site_category_fact ~ ., data = select(site_attr_data_rf1,-site_no), importance = TRUE) set.seed(1999) sites_sampled_rf2 <- site_spatial_grps %>% group_by(spatial_grp) %>% sample_n(size = 1) %>% ungroup() %>% pull(site_no) site_attr_data_rf2<- site_attr_data %>% filter(site_no %in% sites_sampled_rf2) rf2 <- randomForest( site_category_fact ~ ., data = select(site_attr_data_rf2,-site_no), importance = TRUE) set.seed(457) sites_sampled_rf3 <- site_spatial_grps %>% group_by(spatial_grp) %>% sample_n(size = 1) %>% ungroup() %>% pull(site_no) site_attr_data_rf3 <- site_attr_data %>% filter(site_no %in% sites_sampled_rf3) rf3 <- randomForest( site_category_fact ~ ., data = select(site_attr_data_rf3,-site_no), importance = TRUE) sort(rf1$importance[,'MeanDecreaseGini']) sort(rf2$importance[,'MeanDecreaseGini']) sort(rf3$importance[,'MeanDecreaseGini']) ```

Raw console output

``` > sort(rf1$importance[,'MeanDecreaseGini']) pctWetland roadSaltPerSqKm subsurfaceContact annualPrecip 1.789537 2.169221 2.328116 2.475378 baseFlowInd pctForested gwRecharge pctAgriculture 2.774530 2.870516 2.925041 2.973717 annualSnow basinSlope winterAirTemp pctOpenWater 3.018238 3.283524 3.333059 3.466629 depthToWT pctDeveloped medianFlow transmissivity 4.137258 5.039094 5.151128 5.190696 > sort(rf2$importance[,'MeanDecreaseGini']) pctWetland roadSaltPerSqKm annualPrecip pctAgriculture 1.816316 2.110385 2.194887 2.395629 subsurfaceContact baseFlowInd annualSnow winterAirTemp 2.606940 2.851376 2.926679 2.948906 basinSlope pctForested gwRecharge pctOpenWater 3.031009 3.114202 3.207737 4.369981 pctDeveloped depthToWT transmissivity medianFlow 4.579645 4.583691 4.973383 6.181863 > sort(rf3$importance[,'MeanDecreaseGini']) roadSaltPerSqKm annualPrecip pctWetland pctForested 1.946963 2.156725 2.287663 2.422247 pctAgriculture baseFlowInd subsurfaceContact annualSnow 2.599735 2.740072 2.778382 2.871653 basinSlope winterAirTemp gwRecharge pctOpenWater 3.310435 3.353461 3.648221 3.785933 depthToWT medianFlow transmissivity pctDeveloped 4.384110 4.402881 4.608171 4.852849 ```

lindsayplatt / salt-modeling-data

Checking for spatial autocorrelation in RF models #65