tidymodels / spatialsample

Create and summarize spatial resampling objects 🗺
https://spatialsample.tidymodels.org
Other
71 stars 5 forks source link

`spatial_clustering_cv()` does not seem to work with a tibble #129

Closed RaymondBalise closed 1 year ago

RaymondBalise commented 1 year ago

Hello spatialsample folks. I am trying to replicate Dr Silge's vlog which shows spatialsample and I get this error:

library(spatialsample)
data("lsl", package = "spDataLarge")

library(tidymodels)
landslides <- as_tibble(lsl)

set.seed(123)
good_folds <- spatial_clustering_cv(landslides, coords = c("x", "y"), v = 5)
#> Error in `spatial_clustering_cv()`:
#> ! `spatial_clustering_cv()` currently only supports `sf` objects.
#> ℹ Try converting `data` to an `sf` object via `sf::st_as_sf()`.

#> Backtrace:
#>     ▆
#>  1. └─spatialsample::spatial_clustering_cv(landslides, coords = c("x", "y"), v = 5)
#>  2.   └─spatialsample:::standard_checks(data, "`spatial_clustering_cv()`")
#>  3.     └─spatialsample:::check_sf(data, calling_function, call)
#>  4.       └─rlang::abort(...)

Created on 2023-01-19 with reprex v2.0.2

Session info ``` r sessioninfo::session_info() #> ─ Session info ─────────────────────────────────────────────────────────────── #> setting value #> version R version 4.2.2 (2022-10-31) #> os macOS Ventura 13.1 #> system aarch64, darwin20 #> ui X11 #> language (EN) #> collate en_US.UTF-8 #> ctype en_US.UTF-8 #> tz America/New_York #> date 2023-01-19 #> pandoc 2.19.2 @ /Applications/RStudio.app/Contents/Resources/app/quarto/bin/tools/ (via rmarkdown) #> #> ─ Packages ─────────────────────────────────────────────────────────────────── #> package * version date (UTC) lib source #> assertthat 0.2.1 2019-03-21 [1] CRAN (R 4.2.0) #> backports 1.4.1 2021-12-13 [1] CRAN (R 4.2.0) #> broom * 1.0.2 2022-12-15 [1] CRAN (R 4.2.0) #> class 7.3-20 2022-01-16 [1] CRAN (R 4.2.2) #> classInt 0.4-8 2022-09-29 [1] CRAN (R 4.2.0) #> cli 3.6.0 2023-01-09 [1] CRAN (R 4.2.0) #> codetools 0.2-18 2020-11-04 [1] CRAN (R 4.2.2) #> colorspace 2.0-3 2022-02-21 [1] CRAN (R 4.2.0) #> DBI 1.1.3 2022-06-18 [1] CRAN (R 4.2.0) #> dials * 1.1.0 2022-11-04 [1] CRAN (R 4.2.0) #> DiceDesign 1.9 2021-02-13 [1] CRAN (R 4.2.0) #> digest 0.6.31 2022-12-11 [1] CRAN (R 4.2.0) #> dplyr * 1.0.10 2022-09-01 [1] CRAN (R 4.2.0) #> e1071 1.7-12 2022-10-24 [1] CRAN (R 4.2.0) #> evaluate 0.20 2023-01-17 [1] CRAN (R 4.2.0) #> fansi 1.0.3 2022-03-24 [1] CRAN (R 4.2.0) #> fastmap 1.1.0 2021-01-25 [1] CRAN (R 4.2.0) #> foreach 1.5.2 2022-02-02 [1] CRAN (R 4.2.0) #> fs 1.5.2 2021-12-08 [1] CRAN (R 4.2.0) #> furrr 0.3.1 2022-08-15 [1] CRAN (R 4.2.0) #> future 1.30.0 2022-12-16 [1] CRAN (R 4.2.0) #> future.apply 1.10.0 2022-11-05 [1] CRAN (R 4.2.0) #> generics 0.1.3 2022-07-05 [1] CRAN (R 4.2.0) #> ggplot2 * 3.4.0 2022-11-04 [1] CRAN (R 4.2.0) #> globals 0.16.2 2022-11-21 [1] CRAN (R 4.2.2) #> glue 1.6.2 2022-02-24 [1] CRAN (R 4.2.0) #> gower 1.0.1 2022-12-22 [1] CRAN (R 4.2.0) #> GPfit 1.0-8 2019-02-08 [1] CRAN (R 4.2.0) #> gtable 0.3.1 2022-09-01 [1] CRAN (R 4.2.0) #> hardhat 1.2.0 2022-06-30 [1] CRAN (R 4.2.0) #> highr 0.10 2022-12-22 [1] CRAN (R 4.2.0) #> htmltools 0.5.4 2022-12-07 [1] CRAN (R 4.2.0) #> infer * 1.0.4 2022-12-02 [1] CRAN (R 4.2.0) #> ipred 0.9-13 2022-06-02 [1] CRAN (R 4.2.0) #> iterators 1.0.14 2022-02-05 [1] CRAN (R 4.2.0) #> KernSmooth 2.23-20 2021-05-03 [1] CRAN (R 4.2.2) #> knitr 1.41 2022-11-18 [1] CRAN (R 4.2.0) #> lattice 0.20-45 2021-09-22 [1] CRAN (R 4.2.2) #> lava 1.7.1 2023-01-06 [1] CRAN (R 4.2.0) #> lhs 1.1.6 2022-12-17 [1] CRAN (R 4.2.0) #> lifecycle 1.0.3 2022-10-07 [1] CRAN (R 4.2.0) #> listenv 0.9.0 2022-12-16 [1] CRAN (R 4.2.0) #> lubridate 1.9.0 2022-11-06 [1] CRAN (R 4.2.0) #> magrittr 2.0.3 2022-03-30 [1] CRAN (R 4.2.0) #> MASS 7.3-58.1 2022-08-03 [1] CRAN (R 4.2.2) #> Matrix 1.5-3 2022-11-11 [1] CRAN (R 4.2.0) #> modeldata * 1.0.1 2022-09-06 [1] CRAN (R 4.2.0) #> munsell 0.5.0 2018-06-12 [1] CRAN (R 4.2.0) #> nnet 7.3-18 2022-09-28 [1] CRAN (R 4.2.2) #> parallelly 1.34.0 2023-01-13 [1] CRAN (R 4.2.0) #> parsnip * 1.0.3 2022-11-11 [1] CRAN (R 4.2.0) #> pillar 1.8.1 2022-08-19 [1] CRAN (R 4.2.0) #> pkgconfig 2.0.3 2019-09-22 [1] CRAN (R 4.2.0) #> prodlim 2019.11.13 2019-11-17 [1] CRAN (R 4.2.0) #> proxy 0.4-27 2022-06-09 [1] CRAN (R 4.2.0) #> purrr * 1.0.1 2023-01-10 [1] CRAN (R 4.2.0) #> R.cache 0.16.0 2022-07-21 [1] CRAN (R 4.2.0) #> R.methodsS3 1.8.2 2022-06-13 [1] CRAN (R 4.2.0) #> R.oo 1.25.0 2022-06-12 [1] CRAN (R 4.2.0) #> R.utils 2.12.2 2022-11-11 [1] CRAN (R 4.2.0) #> R6 2.5.1 2021-08-19 [1] CRAN (R 4.2.0) #> Rcpp 1.0.9 2022-07-08 [1] CRAN (R 4.2.0) #> recipes * 1.0.4 2023-01-11 [1] CRAN (R 4.2.0) #> reprex 2.0.2 2022-08-17 [1] CRAN (R 4.2.0) #> rlang 1.0.6 2022-09-24 [1] CRAN (R 4.2.0) #> rmarkdown 2.19 2022-12-15 [1] CRAN (R 4.2.0) #> rpart 4.1.19 2022-10-21 [1] CRAN (R 4.2.2) #> rsample * 1.1.1 2022-12-07 [1] CRAN (R 4.2.0) #> rstudioapi 0.14 2022-08-22 [1] CRAN (R 4.2.0) #> scales * 1.2.1 2022-08-20 [1] CRAN (R 4.2.0) #> sessioninfo 1.2.2 2021-12-06 [1] CRAN (R 4.2.0) #> sf 1.0-9 2022-11-08 [1] CRAN (R 4.2.0) #> spatialsample * 0.3.0 2023-01-17 [1] CRAN (R 4.2.0) #> stringi 1.7.12 2023-01-11 [1] CRAN (R 4.2.0) #> stringr 1.5.0 2022-12-02 [1] CRAN (R 4.2.0) #> styler 1.9.0 2023-01-15 [1] CRAN (R 4.2.0) #> survival 3.5-0 2023-01-09 [1] CRAN (R 4.2.0) #> tibble * 3.1.8 2022-07-22 [1] CRAN (R 4.2.0) #> tidymodels * 1.0.0 2022-07-13 [1] CRAN (R 4.2.0) #> tidyr * 1.2.1 2022-09-08 [1] CRAN (R 4.2.0) #> tidyselect 1.2.0 2022-10-10 [1] CRAN (R 4.2.0) #> timechange 0.2.0 2023-01-11 [1] CRAN (R 4.2.0) #> timeDate 4022.108 2023-01-07 [1] CRAN (R 4.2.0) #> tune * 1.0.1 2022-10-09 [1] CRAN (R 4.2.0) #> units 0.8-1 2022-12-10 [1] CRAN (R 4.2.0) #> utf8 1.2.2 2021-07-24 [1] CRAN (R 4.2.0) #> vctrs 0.5.1 2022-11-16 [1] CRAN (R 4.2.0) #> withr 2.5.0 2022-03-03 [1] CRAN (R 4.2.0) #> workflows * 1.1.2 2022-11-16 [1] CRAN (R 4.2.0) #> workflowsets * 1.0.0 2022-07-12 [1] CRAN (R 4.2.0) #> xfun 0.36 2022-12-21 [1] CRAN (R 4.2.0) #> yaml 2.3.6 2022-10-18 [1] CRAN (R 4.2.0) #> yardstick * 1.1.0 2022-09-07 [1] CRAN (R 4.2.0) #> #> [1] /Library/Frameworks/R.framework/Versions/4.2-arm64/Resources/library #> #> ────────────────────────────────────────────────────────────────────────────── ```

The manual page says the data argument can take a data frame but as you can see I am getting an error message saying it only takes a sf object. I tried to wrap the data in sf::st_as_sf() but then it wanted me to use st_set_crs(). I can keep going down that rabbit hole but I thought it was worth asking if there is an easy fix. Is there still a way to replicate the results in the vlog?

mikemahoney218 commented 1 year ago

Hey @RaymondBalise ! I've got both a proximate and an ultimate answer for your question:

Ultimate: We moved the data.frame version of this function into the new clustering_cv() function in rsample. There's just too many bad things that can happen from doing spatial operations without using spatial classes -- the big one is that, if your data was using latitude and longitude for coordinates, the data.frame method would return incorrect results (especially if your data covered larger areas -- but we noticed real differences even for the Ames data). You can use that function with data.frame inputs just like you used to use spatial_clustering_cv() (but shouldn't, see next paragraph) -- and can also provide your own distance and clustering functions to it, which means you can use it for any domain where splitting data based on "distance" makes sense, not just spatially.

Proximate: In this situation, because you are using spatial data, it's best to treat it spatially using sf so that things like the units of your coordinates are preserved. Looking at ?spDataLarge::lsl, I see that the CRS information is listed as CRS: UTM zone 17S; EPSG:32717. Because that's a pre-set EPSG code, we can convert lsl to an sf object by setting crs = 32717 in st_as_sf(). That will let you use spatialsample, and will also make plotting a bit easier, as ggplot2 will automatically infer the correct projection.

So, that said, I recommend doing the first method below, but the second also works for this data (but wouldn't, if your coordinates were in latitude and longitude):

library(spatialsample)
data("lsl", package = "spDataLarge")

# When working with spatial data, it's best to use spatial classes
# to preserve metadata and make sure geometry calculations are done correctly;
# 
# we can do that via spatialsample as so:
set.seed(123)
best_way <- sf::st_as_sf(lsl, coords = c("x", "y"), crs = 32717) |> 
  spatial_clustering_cv(v = 5)

# That said, the old method is still entirely do-able via rsample:
library(rsample)
set.seed(123)
old_way <- clustering_cv(
  lsl,
  c("x", "y"),
  v = 5
)

# These produce equivalent folds -- I'm using a bit of a hack here to confirm
# that yes, the same data are assigned to the assessment set for both methods, 
# but because we aren't buffering or anything that also means the same data
# is assigned to the _analysis_ folds for each method, too:
all(
  vapply(
    seq_len(nrow(best_way)),
    \(i) {
      all(
        complement(get_rsplit(old_way, i)) == complement(get_rsplit(best_way, i))
      )
    },
    logical(1)
  )
)
#> [1] TRUE

Created on 2023-01-19 with reprex v2.0.2

Thanks for pointing out the docs weren't updated -- I'll fix that now.

RaymondBalise commented 1 year ago

Thank you so much for the quick, excellent reply. I will add the tweak into my materials when I teach.

github-actions[bot] commented 1 year ago

This issue has been automatically locked. If you believe you have found a related problem, please file a new issue (with a reprex: https://reprex.tidyverse.org) and link to this issue.