Open utterances-bot opened 2 years ago
great video! been waiting for one to specifically deal with spatial data.
quick question: a lot of your past blog posts deal with datasets that have spatial components to them - NY AirB&B, Chicago traffic accidents, etc - would you say that the resampling methodology you applied in those analyses was incorrect?
@michaelgaunt404 Real world datasets like those have a lot of structure in them, for sure, including spatial structure. Both of the examples you mention directly use latitude and longitude as predictors, which is different than the situation here in this blog post where we are using data with geographic characteristics (but not literally the location). But like you expected, yep, I just reran the NYC Airbnb analysis and the RMSE is a bit higher when using spatial CV with the same k
, ~0.48 on the scale I used in that other blog post.
Thanks for another awesome video! Not sure if this is the place to make 'requests' for future videos, but I'd be very interested to see you show feature selection performed as part of the resampling process (which I was reading about here https://bookdown.org/max/FES/selection-overfitting.html). Thanks!
You're amazing maam. I've been following your channel for a while. Thanks for your great skill, I'm inspired to learn more from you. I'm from Philippines, by the way, there's not much people who are interested with 'data analysis'. THanks to you, you make R and Tidy easy! God bless you.
I've been following your videos for a while and I'm happy to see you're talking about spatial data issues. I just want draw your attention to this paper in which they challenge the use of spatial cross-validation to evaluate map accuracy.
Spatial cross-validation is not the right way to evaluate map accuracy https://doi.org/10.1016/j.ecolmodel.2021.109692
Thanks for the great content you're creating!
Thank you so much for this tutorial Julia!
Hi Julia. Love your blogs and videos! I have a question. Just as an example. Let say I am run an xgboost model. I tune_grid with spatial_clustering_cv above, and then select the best model. But then how would I use last_fit, which usually need (I think) a split dataset (which does not take into account lon and lat). How would you do this? Hope it is clear. thanks! Martin
@maxlavoie I really like the writing that the mlr3 folks have done on this topic. You can read here, and also in the book Geocomputation with R. The main argument is that something like spatial resampling will give you a better estimate of performance than if you evaluate on a randomly heldout test set.
This does mean that you probably won't want to use last_fit()
to fit one time to the whole training set and evaluate one time on the testing set (because we expect that to be too optimistic). Instead you would use your spatial resampling results to estimate your performance and just plain old fit()
your best (tuned) model to your training set. You can read about using fit()
on workflows.
Hello Julia,
I’m reaching because I encountered an interesting problem.
I partitioned the data without problem via;
good_folds <- spatial_clustering_cv(landslides, coords = c("x", "y"), v = 5)
and;
bad_folds <- vfold_cv(landslides, v = 5, strata = lslpts)
Also, ran the function without a problem.
But when I ran these two lines below (separate or together);
walk(good_folds$splits, split_plots) walk(bad_folds$splits, split_plots)
R brings “one" of my older plots belonging to a total different project, which has nothing to do with machine learning or clustering, right in the specified counts (5+5 =10).
The only thing in common between the split plots and my plot is the word “Analysis”. I titled my plot months ago “Weekday Analysis”.
So I,
Still no luck.
I add a screen shot of the outputs to the below link address, so you could get a sense of what I mean: https://ibb.co/0syKPys
One last thing; I receive a warning when loading your library(silgelib) saying it’s not available for my version of R, which is; [1] "R version 4.1.0 (2021-05-18)”
My platform is; $platform [1] "x86_64-apple-darwin17.0"
But again, all the other functions and lines work properly but the purrr’s walk.
I like your tutorials very much and want to keep following along the rest of the video to the modeling phase. So I’d be appreciated if you could spare some time to check.
Thank you,
Oh my gosh @goodyonsen that sounds BANANAS 🍌
I have never seen this before, but it sounds like something to do with ggplot2 or walk()
? If you can reproduce this with, say, a simple example like this I would recommend posting on the purrr repo or on RStudio Community.
@juliasilge
Of course!
I will copy my chunks as they are below:
knitr::opts_chunk$set(echo = TRUE, cache = TRUE, cache.lazy = FALSE,
warning = FALSE, message = FALSE, dpi = 180,
fig.width = 4, fig.height = 3)
library(pacman)
p_load(tidyverse, silgelib, spData,
tidymodels, spatialsample, purrr)
# note that "spDataLarge" needs to be installed via:
# install.packages("spDataLarge",
# repos = "https://nowosad.github.io/drat/",
# type = "source")
library(spDataLarge)
data("lsl", package = "spDataLarge")
landslides <- as.tibble(lsl)
landslides
# ABOUT THE DATASET:
# information about slope, elevation, waterflow, and spatial characteristics
ggplot(landslides, aes(x, y)) +
geom_point(aes(color = lslpts), alpha = 0.8) +
coord_fixed()
# OUTPUT:
# We have landslides packed in the middle
ggplot(landslides, aes(x, y)) +
stat_summary_hex(aes(z = elev), alpha = 1, bins = 5) +
geom_point(aes(color = lslpts)) +
coord_fixed()
# OUTPUT:
# Turns out that elevation doesn't play significant role in landslides because
# regardless of the heights the landslides seem to be packed in the mid-elevations.
# So maybe slope plays bigger role.
ggplot(landslides, aes(x, y)) +
stat_summary_hex(aes(z = elev), alpha = 0.7, bins = 10) +
geom_point(aes(color = lslpts)) +
coord_fixed() +
scale_fill_viridis_c() + # scale-filling for continuous data
scale_color_manual(values = c("red", "midnightblue")) +
labs(fill = "Elevation", color = "Landslide")
# IMPORTANT NOTE:
# Points that are "close" to each other tend to be "similar" to each other, which means "autocorrelation".
# This needs to be taken into account when resampling.
# That's why "library(spatialsample)" is needed.
# spatial_clustering_cv() finds clusters on the coordinates that look like belong together via "k-means".
# library(tidymodels)
# library(spatialsample) - package for spatial resampling
# We make 2 sets of resamples:
# 1) k-means:
set.seed(123)
good_folds <- spatial_clustering_cv(landslides,
coords = c("x", "y"),
v = 5)
good_folds
# NOTES:
# "spatial_clustering_cv" is designed for "spatial cross-validation".
# It splits the data into defined number of groups (v = 5) using "k-means" clustering.
# It uses different number of observations in each folds:
# 2) traditional:
set.seed(234)
bad_folds <- vfold_cv(landslides, v = 5, strata = lslpts)
bad_folds
# NOTES:
# We used regular cross-validation, which splits data into resamples (folds) "evenly" for each of the folds.
plot_splits <- function(split) {
# binding each split's(fold's) analysis set:
p <- bind_rows(
analysis(split) %>%
# we call that "Analysis" and add as a column:
mutate(analysis = "Analysis"),
# we bind these Analysis values together with the assessment set of each split:
assessment(split) %>%
mutate(analysis = "Assessment")
) %>%
# we pipe all these into a plot:
ggplot(aes(x, y, color = analysis)) +
geom_point(size = 1.5, alpha = 0.8) +
coord_fixed() +
labs(color = NULL)
print(p)
}
# library(purrr)
# we walk through all the splits we have, and plot all of them:
# 1) good_folds:
walk(good_folds$splits, plot_splits)
# 2) bad_folds:
walk(bad_folds$splits, plot_splits)
Thanks again !
@goodyonsen If you can create a small reprex demonstrating this problem, I recommend posting it on RStudio Community or the purrr repo, like I linked above. That will be a better way to get some help with this (non-tidymodels) problem.
@juliasilge OK, I never used reprex() before. I tried to do my best. Hope it worked. If not, please let me know. By the way, the issue with the walk() seems to have gone now. I don’t understand how, but I finally had the graphs back. I didn’t even have to clean the Global env. Installing reprex() backed up and restarted the session and then I see all the plots as they are supposed to be.
library(pacman) p_load(tidyverse, silgelib, spData, tidymodels, spatialsample, purrr)
library(spDataLarge)
data("lsl", package = "spDataLarge")
landslides <- as.tibble(lsl)
as.tibble()
was deprecated in tibble 2.0.0.as_tibble()
instead.?as_tibble
.lifecycle::last_lifecycle_warnings()
to see where this warning was generated.landslides
ggplot(landslides, aes(x, y)) + geom_point(aes(color = lslpts), alpha = 0.8) + coord_fixed()
ggplot(landslides, aes(x, y)) + stat_summary_hex(aes(z = elev), alpha = 1, bins = 5) + geom_point(aes(color = lslpts)) + coord_fixed()
ggplot(landslides, aes(x, y)) + stat_summary_hex(aes(z = elev), alpha = 0.7, bins = 10) + geom_point(aes(color = lslpts)) + coord_fixed() + scale_fill_viridis_c() + # scale-filling for continuous data scale_color_manual(values = c("red", "midnightblue")) + labs(fill = "Elevation", color = "Landslide")
set.seed(123)
good_folds <- spatial_clustering_cv(landslides, coords = c("x", "y"), v = 5) good_folds
set.seed(234) bad_folds <- vfold_cv(landslides, v = 5, strata = lslpts) bad_folds
plot_splits <- function(split) {
p <- bind_rows( analysis(split) %>%
mutate(analysis = "Analysis"),
# we bind these Analysis values together with the assessment set of each split:
assessment(split) %>%
mutate(analysis = "Assessment")
) %>%
ggplot(aes(x, y, color = analysis)) +
geom_point(size = 1.5, alpha = 0.8) +
coord_fixed() +
labs(color = NULL)
print(p) }
walk(good_folds$splits, plot_splits)
Hello from 2023!
As of rsample
1.1.1 and spatialsample
0.3.0, the spatial_clustering_cv()
function only handles sf
objects, with data.frame
and tibbles being handled by rsample::clustering_cv()
. If you're trying to follow this tutorial, check out the linked comment on GitHub to see two different ways to perform clustered cross-validation with the lsl
data:
https://github.com/tidymodels/spatialsample/issues/129#issuecomment-1397604726
Thank you so much @mikemahoney218! 🙌
Julia As ususal thanks for the amazing job. I have the following issue. We work with data from wells, in in one specific scenario we normallymight have 500 samples per well, but we normally have lets say 5 wells. I´d like to create with my training set many folds something like this:
With the spatial splitting, I only get a max of one fold per location (i.e. well), and If I request more ( argument "v=") it complain s that there will be some identical folds.
Any hints?
@edperalt not Julia, but if you're looking to do a 90/10 initial split followed by a spatial CV approach, I'd use initial_split()
followed by spatial_leave_location_out_cv()
(which would also let you use inclusion radii/exclusion buffers, if you want) or group_vfold_cv()
(which doesn't do radii/buffers, but otherwise works identically). Sorry if I misunderstood what you're looking to do :smile:
Spatial resampling for #TidyTuesday and the #30DayMapChallenge | Julia Silge
Use spatial resampling to more accurately estimate model performance for geographic data.
https://juliasilge.com/blog/map-challenge/