utterances-bot commented 2 years ago

Spatial resampling for #TidyTuesday and the #30DayMapChallenge | Julia Silge

Use spatial resampling to more accurately estimate model performance for geographic data.

https://juliasilge.com/blog/map-challenge/

michaelgaunt404 commented 2 years ago

great video! been waiting for one to specifically deal with spatial data.

quick question: a lot of your past blog posts deal with datasets that have spatial components to them - NY AirB&B, Chicago traffic accidents, etc - would you say that the resampling methodology you applied in those analyses was incorrect?

juliasilge commented 2 years ago

@michaelgaunt404 Real world datasets like those have a lot of structure in them, for sure, including spatial structure. Both of the examples you mention directly use latitude and longitude as predictors, which is different than the situation here in this blog post where we are using data with geographic characteristics (but not literally the location). But like you expected, yep, I just reran the NYC Airbnb analysis and the RMSE is a bit higher when using spatial CV with the same k, ~0.48 on the scale I used in that other blog post.

Jeffrothschild commented 2 years ago

Thanks for another awesome video! Not sure if this is the place to make 'requests' for future videos, but I'd be very interested to see you show feature selection performed as part of the resampling process (which I was reading about here https://bookdown.org/max/FES/selection-overfitting.html). Thanks!

cornmihkz commented 2 years ago

You're amazing maam. I've been following your channel for a while. Thanks for your great skill, I'm inspired to learn more from you. I'm from Philippines, by the way, there's not much people who are interested with 'data analysis'. THanks to you, you make R and Tidy easy! God bless you.

albhasan commented 2 years ago

I've been following your videos for a while and I'm happy to see you're talking about spatial data issues. I just want draw your attention to this paper in which they challenge the use of spatial cross-validation to evaluate map accuracy.

Spatial cross-validation is not the right way to evaluate map accuracy https://doi.org/10.1016/j.ecolmodel.2021.109692

Thanks for the great content you're creating!

data-datum commented 2 years ago

Thank you so much for this tutorial Julia!

maxlavoie commented 2 years ago

Hi Julia. Love your blogs and videos! I have a question. Just as an example. Let say I am run an xgboost model. I tune_grid with spatial_clustering_cv above, and then select the best model. But then how would I use last_fit, which usually need (I think) a split dataset (which does not take into account lon and lat). How would you do this? Hope it is clear. thanks! Martin

juliasilge commented 2 years ago

@maxlavoie I really like the writing that the mlr3 folks have done on this topic. You can read here, and also in the book Geocomputation with R. The main argument is that something like spatial resampling will give you a better estimate of performance than if you evaluate on a randomly heldout test set.

This does mean that you probably won't want to use last_fit() to fit one time to the whole training set and evaluate one time on the testing set (because we expect that to be too optimistic). Instead you would use your spatial resampling results to estimate your performance and just plain old fit() your best (tuned) model to your training set. You can read about using fit() on workflows.

goodyonsen commented 2 years ago

Hello Julia,

I’m reaching because I encountered an interesting problem.

I partitioned the data without problem via;

good_folds <- spatial_clustering_cv(landslides, coords = c("x", "y"), v = 5)

and;

bad_folds <- vfold_cv(landslides, v = 5, strata = lslpts)

Also, ran the function without a problem.

But when I ran these two lines below (separate or together);

walk(good_folds$splits, split_plots) walk(bad_folds$splits, split_plots)

R brings “one" of my older plots belonging to a total different project, which has nothing to do with machine learning or clustering, right in the specified counts (5+5 =10).

The only thing in common between the split plots and my plot is the word “Analysis”. I titled my plot months ago “Weekday Analysis”.

So I,

changed your mutate() and label to something other than “Analysis”
deleted everything in the environment pane regarding my earlier project
restarted R session and tried all over again

Still no luck.

I add a screen shot of the outputs to the below link address, so you could get a sense of what I mean: https://ibb.co/0syKPys

One last thing; I receive a warning when loading your library(silgelib) saying it’s not available for my version of R, which is; [1] "R version 4.1.0 (2021-05-18)”

My platform is; $platform [1] "x86_64-apple-darwin17.0"

But again, all the other functions and lines work properly but the purrr’s walk.

I like your tutorials very much and want to keep following along the rest of the video to the modeling phase. So I’d be appreciated if you could spare some time to check.

Thank you,

juliasilge commented 2 years ago

Oh my gosh @goodyonsen that sounds BANANAS 🍌

I have never seen this before, but it sounds like something to do with ggplot2 or walk()? If you can reproduce this with, say, a simple example like this I would recommend posting on the purrr repo or on RStudio Community.

goodyonsen commented 2 years ago

@juliasilge

Of course!

I will copy my chunks as they are below:

knitr::opts_chunk$set(echo = TRUE, cache = TRUE, cache.lazy = FALSE, 
                      warning = FALSE, message = FALSE, dpi = 180,
                      fig.width = 4, fig.height = 3)

Loading libraries:

library(pacman)
p_load(tidyverse, silgelib, spData, 
       tidymodels, spatialsample, purrr)

# note that "spDataLarge" needs to be installed via:
# install.packages("spDataLarge", 
#                 repos = "https://nowosad.github.io/drat/", 
#                  type = "source")

library(spDataLarge)

Loading the dataset:


data("lsl", package = "spDataLarge")

landslides <- as.tibble(lsl)
landslides

# ABOUT THE DATASET: 
# information about slope, elevation, waterflow, and spatial characteristics

MAKING EXPLORATORY MAPS:

MAP-1:


ggplot(landslides, aes(x, y)) +
  geom_point(aes(color = lslpts), alpha = 0.8) + 
  coord_fixed()

# OUTPUT: 
# We have landslides packed in the middle

MAP-2: (we add elevation)

ggplot(landslides, aes(x, y)) +
  stat_summary_hex(aes(z = elev), alpha = 1, bins = 5) +
  geom_point(aes(color = lslpts)) + 
  coord_fixed()

# OUTPUT: 
# Turns out that elevation doesn't play significant role in landslides because 
# regardless of the heights the landslides seem to be packed in the mid-elevations. 
# So maybe slope plays bigger role.

MAP - 3: (change colors and add labels)


ggplot(landslides, aes(x, y)) +
  stat_summary_hex(aes(z = elev), alpha = 0.7, bins = 10) +
  geom_point(aes(color = lslpts)) + 
  coord_fixed() + 
  scale_fill_viridis_c() + # scale-filling for continuous data 
  scale_color_manual(values = c("red", "midnightblue")) +
  labs(fill = "Elevation", color = "Landslide")

Create spatial resamples:


# IMPORTANT NOTE: 
  # Points that are "close" to each other tend to be "similar" to each other, which means "autocorrelation". 
  # This needs to be taken into account when resampling. 
  # That's why "library(spatialsample)" is needed. 
  # spatial_clustering_cv() finds clusters on the coordinates that look like belong together via "k-means".

# library(tidymodels)
# library(spatialsample)  - package for spatial resampling

# We make 2 sets of resamples: 

# 1) k-means:
set.seed(123)
good_folds <- spatial_clustering_cv(landslides, 
                                    coords = c("x", "y"), 
                                    v = 5) 
good_folds

# NOTES: 
# "spatial_clustering_cv" is designed for "spatial cross-validation". 
# It splits the data into defined number of groups (v = 5) using "k-means" clustering. 
# It uses different number of observations in each folds:

# 2) traditional: 
set.seed(234)
bad_folds <- vfold_cv(landslides, v = 5, strata = lslpts)
bad_folds

# NOTES: 
# We used regular cross-validation, which splits data into resamples (folds) "evenly" for each of the folds.

PLOTTING ALL THE SPLITS WE CREATED:

Writing a function to plot each split:


plot_splits <- function(split) {
  # binding each split's(fold's) analysis set:
  p <- bind_rows(
    analysis(split) %>% 
     # we call that "Analysis" and add as a column:
      mutate(analysis = "Analysis"),
    # we bind these Analysis values together with the assessment set of each split:
    assessment(split) %>% 
      mutate(analysis = "Assessment")
  ) %>% 
    # we pipe all these into a plot:
    ggplot(aes(x, y, color = analysis)) +
    geom_point(size = 1.5, alpha = 0.8) + 
    coord_fixed() +
    labs(color = NULL)
  print(p)
}

Using the function:

# library(purrr)
# we walk through all the splits we have, and plot all of them:

# 1) good_folds:
walk(good_folds$splits, plot_splits)

# 2) bad_folds:
walk(bad_folds$splits, plot_splits)

Thanks again !

juliasilge commented 2 years ago

@goodyonsen If you can create a small reprex demonstrating this problem, I recommend posting it on RStudio Community or the purrr repo, like I linked above. That will be a better way to get some help with this (non-tidymodels) problem.

goodyonsen commented 2 years ago

@juliasilge OK, I never used reprex() before. I tried to do my best. Hope it worked. If not, please let me know. By the way, the issue with the walk() seems to have gone now. I don’t understand how, but I finally had the graphs back. I didn’t even have to clean the Global env. Installing reprex() backed up and restarted the session and then I see all the plots as they are supposed to be.

library(pacman) p_load(tidyverse, silgelib, spData, tidymodels, spatialsample, purrr)

> Warning: package 'silgelib' is not available for this version of R

>

> A version of this package for your version of R might be available elsewhere,

> see the ideas at

> https://cran.r-project.org/doc/manuals/r-patched/R-admin.html#Installing-packages

> Warning in p_install(package, character.only = TRUE, ...):

> Warning in library(package, lib.loc = lib.loc, character.only = TRUE,

> logical.return = TRUE, : there is no package called 'silgelib'

> Warning in p_load(tidyverse, silgelib, spData, tidymodels, spatialsample, : Failed to install/load:

> silgelib

library(spDataLarge)

data("lsl", package = "spDataLarge")

landslides <- as.tibble(lsl)

> Warning: `as.tibble()` was deprecated in tibble 2.0.0.

> Please use `as_tibble()` instead.

> The signature and semantics have changed, see `?as_tibble`.

> This warning is displayed once every 8 hours.

> Call `lifecycle::last_lifecycle_warnings()` to see where this warning was generated.

landslides

> # A tibble: 350 × 8

> x y lslpts slope cplan cprof elev log10_carea

>

> 1 713888. 9558537. FALSE 33.8 0.0232 0.00319 2423. 2.78

> 2 712788. 9558917. FALSE 39.4 -0.0386 -0.0172 2052. 4.15

> 3 713408. 9560307. FALSE 37.5 -0.0133 0.00967 1958. 3.64

> 4 714888. 9560237. FALSE 31.5 0.0409 0.00589 1969. 2.27

> 5 715248. 9557117. FALSE 44.1 0.00969 0.00515 3008. 3.00

> 6 714928. 9560777. FALSE 29.9 -0.00905 -0.00574 1737. 3.17

> 7 714288. 9558367. FALSE 31.6 0.0556 0.0218 2584. 2.25

> 8 714148. 9558467. FALSE 53.4 0.00573 0.00102 2522. 2.58

> 9 713718. 9560657. FALSE 32.6 0.0240 -0.0169 1929. 2.84

> 10 713408. 9560307. FALSE 37.5 -0.0133 0.00967 1958. 3.64

> # … with 340 more rows

ggplot(landslides, aes(x, y)) + geom_point(aes(color = lslpts), alpha = 0.8) + coord_fixed()

ggplot(landslides, aes(x, y)) + stat_summary_hex(aes(z = elev), alpha = 1, bins = 5) + geom_point(aes(color = lslpts)) + coord_fixed()

ggplot(landslides, aes(x, y)) + stat_summary_hex(aes(z = elev), alpha = 0.7, bins = 10) + geom_point(aes(color = lslpts)) + coord_fixed() + scale_fill_viridis_c() + # scale-filling for continuous data scale_color_manual(values = c("red", "midnightblue")) + labs(fill = "Elevation", color = "Landslide")

set.seed(123)

good_folds <- spatial_clustering_cv(landslides, coords = c("x", "y"), v = 5) good_folds

> # 5-fold spatial cross-validation

> # A tibble: 5 × 2

> splits id

>

> 1 <split [283/67]> Fold1

> 2 <split [310/40]> Fold2

> 3 <split [299/51]> Fold3

> 4 <split [242/108]> Fold4

> 5 <split [266/84]> Fold5

set.seed(234) bad_folds <- vfold_cv(landslides, v = 5, strata = lslpts) bad_folds

> # 5-fold cross-validation using stratification

> # A tibble: 5 × 2

> splits id

>

> 1 <split [280/70]> Fold1

> 2 <split [280/70]> Fold2

> 3 <split [280/70]> Fold3

> 4 <split [280/70]> Fold4

> 5 <split [280/70]> Fold5

plot_splits <- function(split) {

binding each split's(fold's) analysis set:

p <- bind_rows( analysis(split) %>%

we call that "Analysis" and add as a column

  mutate(analysis = "Analysis"),
# we bind these Analysis values together with the assessment set of each split:
assessment(split) %>% 
  mutate(analysis = "Assessment")

) %>%

we pipe all these into a plot:

ggplot(aes(x, y, color = analysis)) +
geom_point(size = 1.5, alpha = 0.8) + 
coord_fixed() +
labs(color = NULL)

print(p) }

walk(good_folds$splits, plot_splits)

mikemahoney218 commented 1 year ago

Hello from 2023!

As of rsample 1.1.1 and spatialsample 0.3.0, the spatial_clustering_cv() function only handles sf objects, with data.frame and tibbles being handled by rsample::clustering_cv(). If you're trying to follow this tutorial, check out the linked comment on GitHub to see two different ways to perform clustered cross-validation with the lsl data: https://github.com/tidymodels/spatialsample/issues/129#issuecomment-1397604726

juliasilge commented 1 year ago

Thank you so much @mikemahoney218! 🙌

edperalt commented 1 year ago

Julia As ususal thanks for the amazing job. I have the following issue. We work with data from wells, in in one specific scenario we normallymight have 500 samples per well, but we normally have lets say 5 wells. I´d like to create with my training set many folds something like this:

Folds 1- 10, use 90 % of samples for wells 2-5, ( the 10 % missing, moving on each fold until all samples have been taken away) , and well 1 as validation .
Folds 11-20 use 90 % of samples for wells 1, 3,4,5, ( the 10 % missing, moving on each fold until all samples have been taken away) , and well 2 as validation .
Folds 21-30 use 90 % of samples for wells 1, 2,4,5, ( the 10 % missing, moving on each fold until all samples have been taken away) , and well 3 as validation .
Folds 31-40 use 90 % of samples for wells 1, 2,3,5, ( the 10 % missing, moving on each fold until all samples have been taken away) , and well 4 as validation .
Folds 41-50 use 90 % of samples for wells 1, 2,3,4 ( the 10 % missing, moving on each fold until all samples have been taken away) , and well 5 as validation .

With the spatial splitting, I only get a max of one fold per location (i.e. well), and If I request more ( argument "v=") it complain s that there will be some identical folds.

Any hints?

mikemahoney218 commented 1 year ago

@edperalt not Julia, but if you're looking to do a 90/10 initial split followed by a spatial CV approach, I'd use initial_split() followed by spatial_leave_location_out_cv() (which would also let you use inclusion radii/exclusion buffers, if you want) or group_vfold_cv() (which doesn't do radii/buffers, but otherwise works identically). Sorry if I misunderstood what you're looking to do :smile:

juliasilge / juliasilge.com