Closed lindsayplatt closed 2 months ago
A basic random forest using randomForest::randomForest()
(no spatial weighting or awareness) was run 3 different times with different seeds. All three times the results showed the exact same top 6 predictors and the exact same bottom 3 predictors. The remaining 7 predictors in the middle shifted around slightly.
Top 6 predictors for each run:
Bottom 3 predictors for each run:
Next I tried applying spatial weighting to the random forest modeling by using SpatialML::grf()
. I tried both with and without the geo-weighting just to compare the basic random forest use-case to the randomForest::randomForest()
approach. Unfortunately, there is some issue with the package and I keep getting failures near the end of the function; however, it does give me an importance vector before failing, so I have copied that output to use for this comparison.
The new non-spatially weighted random forest output has the exact same order of the top 6 predictors as the 3 runs from the randomForest::randomForest()
approach. It also has the the same bottom predictors but 14 & 15 are switched.
For the spatially aware random forests, the top 6 same predictors appear again and again. Two of the three tests have the second and third highest importance predictors (transmissivity
and depthToWT
) switched but all others appear exactly in the same order. All of these spatially aware random forests have the exact same bottom 3 predictors but the order of the bottom 3 is slightly different in each run, though all have Gini indices within 0.36 of each other, so they are all very similar.
The final test was to randomly select 90% of the 124 sites from the full set of predictors and then re-run the random forest using randomForest::randomForest()
.
This produced slightly different results from the other tests so far. The top 5 was always one of the original top 6 predictors.
medianFlow
was always the top predictor. transmissivity
was always second or third.depthToWT
was always in the top 5.pctDeveloped
always made the top 4.pctForested
bounced around from 7th to 5th to 6th.gwRecharge
bounced around from 5th to 8th to 4th.baseFlowInd
made it to 6th two of the three runs when gwRecharge
and pctForested
dropped down.For me, this exercise was confirmation that spatial autocorrelation is not a huge impact on the results and interpretation of the final random forest models, and I will move forward with only using randomForest::randomForest()
without any spatial weighting.
Although, one caveat to all this is that I am realizing the spatial weighting may not have worked - it might be erroring before running the actual weighted model (I think "global model" output is the unweighted and the "local model" output would be the spatially weighted). So, those are just more tests of the regular, unweighted random forest.
The 90% test may have been better but not sure if it actually randomized sites based on location well enough (green = first test, blue = second test, red = third test):
# This code should go along with the code used to run these tests
library(sf)
source("6_DefineCharacteristics/src/visualize_attribute_distributions.R")
sites_sf <- tar_read(p1_nwis_sc_sites_sf)
rf1_sites_sf <- sites_sf %>%
left_join(site_attr_data_rf1) %>%
filter(!is.na(site_category_fact))
rf2_sites_sf <- sites_sf %>%
left_join(site_attr_data_rf2) %>%
filter(!is.na(site_category_fact))
rf3_sites_sf <- sites_sf %>%
left_join(site_attr_data_rf3) %>%
filter(!is.na(site_category_fact))
ggplot() +
add_state_basemap(tar_read(p1_conus_state_cds)) +
geom_sf(data=rf1_sites_sf, fill = 'green',
alpha=0.75, shape=24, size=2) +
geom_sf(data=rf2_sites_sf, fill = 'blue',
alpha=0.75, shape=24, size=2) +
geom_sf(data=rf3_sites_sf, fill = 'red',
alpha=0.75, shape=24, size=2) +
scico::scale_fill_scico_d(begin = 0, end = 0.75) +
theme_void()
Alrighty, I am trying the last step but randomly sampling from sites in distinct "groups". Groups were created based on sites within 15 km of each other (see figure below). Then, I only used one site per group and randomly sampled the site used when a group had more than one. It's not a perfect approach because sites within a certain distance will change depending on which site is being assessed and the groups are based on exact matches. Should be an OK approximation, though.
This produced slightly different results from the other tests so far. Each test was only run with 89 of the 124 sites. The top 4 was always the same top 4 as the original runs.
medianFlow
was always in the top 3.transmissivity
was always 1st or 2nd.depthToWT
was always 3rd or 4th.pctDeveloped
always made the top 4.pctForested
bounced around from 11th to 7th to 13th (not really a factor as it was in the not spatially sampled runs).gwRecharge
bounced around from 10th to 6th and 6th again (somewhat important).pctOpenWater
was 5th for all three runs.roadSaltPerKmSq
was always in the bottom 2 and pctWetland
was always in the bottom 3. pctAgriculture
bounced up to 9th but was otherwise low at 12th and 13th.I think this is a better test for spatial autocorrelation but a future improvement if I include in a paper could be creating a grid across the states and then grouping sites that fall into the same grid cell. Though, I am not sure how much of an issue this is and whether I should continue to explore or just accept some of the bias? First, I should probably evaluate whether spatial autocorrelation exists: https://mgimond.github.io/Spatial/spatial-autocorrelation.html.
The point was raised that there may be some bias in the random forest model outputs due to spatial autocorrelation since some of the sites are located very near each other (see the map of sites and the DC area for example). I went through a short exercise to explore this and see if it might be impacting our results. Note that this exercise did not use perfectly optimized random forest models.
The tools: Two R packages were shared with me for running random forest models that are geographically weighted,
SpatialML
andspatialRF
both of which depend on the ranger package. I tested a spatially aware random forest using theSpatialML
package and randomly sampling sites before runningrandomForest::randomForest()
. I did not use thespatialRF
package becausespatialRF
is currently only set up for numeric response variables (regression) or at most binary categorical response variables. We have categorical response variables ranging from 2 - 4 categories so this package would not be appropriate to use.The metric: I am comparing all tests based on the importance rankings of all the variables using the Gini index for the overall mean.