tidymodels / spatialsample

Create and summarize spatial resampling objects 🗺
https://spatialsample.tidymodels.org
Other
71 stars 5 forks source link

Unable to unnest the splits column #157

Closed anjelinejeline closed 8 months ago

anjelinejeline commented 9 months ago

Hi I would like to unnest the rsplit object but I am not able to do it This is my code

set.seed(123)
cluster_folds=spatial_clustering_cv(out_drivers_sf_norm, v = 10)

class(cluster_folds)
autoplot(cluster_folds)

cluster_folds |> as.data.frame() |> tidyr::unnest(c(splits))
Error in `list_sizes()`:
! `x[[1]]` must be a vector, not a <spatial_clustering_split/spatial_rsplit/rsplit> object.
Backtrace:
 1. tidyr::unnest(as.data.frame(cluster_folds), c(splits))
 2. tidyr:::unnest.data.frame(as.data.frame(cluster_folds), c(splits))
 3. tidyr::unchop(...)
 4. tidyr:::df_unchop(...)
 5. vctrs::list_sizes(col)
mikemahoney218 commented 9 months ago

Does this go away if you call library(sf) at the top of your script? Sorry, away from a computer so I can't test this myself, but that should work.

EmilHvitfeldt commented 9 months ago

Hello @anjelinejeline 👋

What are you expecting to get back when applying unnest() here? I don't know the rsample packages as much as @mikemahoney218, but I don't see that as something that these packages support.

anjelinejeline commented 9 months ago

@mikemahoney218 no unfortunately it does not go away ... BTW @EmilHvitfeldt I am trying to unlist the column with the fold data .. I am also struggling to create spatial clusters with equal size.. I need equal sized folds to use the predict function of a spatialregression as it is not possible to predict on a dataset with different size.. can you help me with that too? Is there a function in this package I could use?

mikemahoney218 commented 9 months ago

Sorry - Emil was more careful than I was and understood the actual problem better :)

So the key issue here is that there's not really a column that contains "the fold data" as you might expect. If you're interested, I wrote a blog post a while back about the internals of the objects in rsample and spatialsample, but the key thing is that the splits column doesn't actually contain the data assigned to each fold, but rather the row indices of the assessment set for each split of your data. So "unnesting" here doesn't make a ton of sense, because you don't want to unnest those indices; you want (I think!) a record of what row belongs to what assessment set.

So the easiest way to get that, assuming I understand what you're looking for, is to get each assessment set separately, give it an identifier, and then combine those into a single table.

For example, say we've got some rset object that looks like this:

set.seed(123)
library(spatialsample)
nc <- sf::read_sf(system.file("shape/nc.shp", package = "sf"))

cluster_folds=spatial_clustering_cv(nc, v = 10)

autoplot(cluster_folds)

We could use the following code to pull out what row belongs to what fold (and obviously, drop the ggplot2 code if you just want the output data frame):

lapply(
  seq_len(nrow(cluster_folds)),
  function(fold) {
    get_rsplit(cluster_folds, fold) |> 
      assessment() |> 
      dplyr::mutate(fold = fold)
  }
) |> 
  do.call(what = rbind) |> 
  ggplot2::ggplot(ggplot2::aes(fill = factor(fold))) + 
  ggplot2::geom_sf()

Created on 2024-02-02 with reprex v2.0.2

Session info ``` r sessioninfo::session_info() #> ─ Session info ─────────────────────────────────────────────────────────────── #> setting value #> version R version 4.3.0 (2023-04-21) #> os macOS Ventura 13.3.1 #> system aarch64, darwin20 #> ui X11 #> language (EN) #> collate en_US.UTF-8 #> ctype en_US.UTF-8 #> tz America/New_York #> date 2024-02-02 #> pandoc 3.1.11 @ /opt/homebrew/bin/ (via rmarkdown) #> #> ─ Packages ─────────────────────────────────────────────────────────────────── #> package * version date (UTC) lib source #> class 7.3-22 2023-05-03 [1] CRAN (R 4.3.0) #> classInt 0.4-9 2023-02-28 [1] CRAN (R 4.3.0) #> cli 3.6.1 2023-03-23 [1] CRAN (R 4.3.0) #> codetools 0.2-19 2023-02-01 [1] CRAN (R 4.3.0) #> colorspace 2.1-0 2023-01-23 [1] CRAN (R 4.3.0) #> curl 5.0.2 2023-08-14 [1] CRAN (R 4.3.0) #> DBI 1.1.3 2022-06-18 [1] CRAN (R 4.3.0) #> digest 0.6.33 2023-07-07 [1] CRAN (R 4.3.0) #> dplyr 1.1.2 2023-04-20 [1] CRAN (R 4.3.0) #> e1071 1.7-13 2023-02-01 [1] CRAN (R 4.3.0) #> evaluate 0.21 2023-05-05 [1] CRAN (R 4.3.0) #> fansi 1.0.4 2023-01-22 [1] CRAN (R 4.3.0) #> farver 2.1.1 2022-07-06 [1] CRAN (R 4.3.0) #> fastmap 1.1.1 2023-02-24 [1] CRAN (R 4.3.0) #> fs 1.6.3 2023-07-20 [1] CRAN (R 4.3.0) #> furrr 0.3.1 2022-08-15 [1] CRAN (R 4.3.0) #> future 1.33.0 2023-07-01 [1] CRAN (R 4.3.0) #> generics 0.1.3 2022-07-05 [1] CRAN (R 4.3.0) #> ggplot2 3.4.2 2023-04-03 [1] CRAN (R 4.3.0) #> globals 0.16.2 2022-11-21 [1] CRAN (R 4.3.0) #> glue 1.6.2 2022-02-24 [1] CRAN (R 4.3.0) #> gtable 0.3.3 2023-03-21 [1] CRAN (R 4.3.0) #> highr 0.10 2022-12-22 [1] CRAN (R 4.3.0) #> htmltools 0.5.6 2023-08-10 [1] CRAN (R 4.3.0) #> KernSmooth 2.23-22 2023-07-10 [1] CRAN (R 4.3.0) #> knitr 1.43 2023-05-25 [1] CRAN (R 4.3.0) #> lifecycle 1.0.3 2022-10-07 [1] CRAN (R 4.3.0) #> listenv 0.9.0 2022-12-16 [1] CRAN (R 4.3.0) #> magrittr 2.0.3 2022-03-30 [1] CRAN (R 4.3.0) #> munsell 0.5.0 2018-06-12 [1] CRAN (R 4.3.0) #> parallelly 1.36.0 2023-05-26 [1] CRAN (R 4.3.0) #> pillar 1.9.0 2023-03-22 [1] CRAN (R 4.3.0) #> pkgconfig 2.0.3 2019-09-22 [1] CRAN (R 4.3.0) #> proxy 0.4-27 2022-06-09 [1] CRAN (R 4.3.0) #> purrr 1.0.2 2023-08-10 [1] CRAN (R 4.3.0) #> R.cache 0.16.0 2022-07-21 [1] CRAN (R 4.3.0) #> R.methodsS3 1.8.2 2022-06-13 [1] CRAN (R 4.3.0) #> R.oo 1.25.0 2022-06-12 [1] CRAN (R 4.3.0) #> R.utils 2.12.2 2022-11-11 [1] CRAN (R 4.3.0) #> R6 2.5.1 2021-08-19 [1] CRAN (R 4.3.0) #> Rcpp 1.0.11 2023-07-06 [1] CRAN (R 4.3.0) #> reprex 2.0.2 2022-08-17 [1] CRAN (R 4.3.0) #> rlang 1.1.1 2023-04-28 [1] CRAN (R 4.3.0) #> rmarkdown 2.24 2023-08-14 [1] CRAN (R 4.3.0) #> rsample 1.1.1 2022-12-07 [1] CRAN (R 4.3.0) #> rstudioapi 0.15.0 2023-07-07 [1] CRAN (R 4.3.0) #> s2 1.1.4 2023-05-17 [1] CRAN (R 4.3.0) #> scales 1.2.1 2022-08-20 [1] CRAN (R 4.3.0) #> sessioninfo 1.2.2 2021-12-06 [1] CRAN (R 4.3.0) #> sf 1.0-14 2023-07-11 [1] CRAN (R 4.3.0) #> spatialsample * 0.5.1 2023-11-08 [1] CRAN (R 4.3.1) #> styler 1.10.1 2023-06-05 [1] CRAN (R 4.3.0) #> tibble 3.2.1 2023-03-20 [1] CRAN (R 4.3.0) #> tidyr 1.3.0 2023-01-24 [1] CRAN (R 4.3.0) #> tidyselect 1.2.0 2022-10-10 [1] CRAN (R 4.3.0) #> units 0.8-3 2023-08-10 [1] CRAN (R 4.3.0) #> utf8 1.2.3 2023-01-31 [1] CRAN (R 4.3.0) #> vctrs 0.6.3 2023-06-14 [1] CRAN (R 4.3.0) #> withr 2.5.0 2022-03-03 [1] CRAN (R 4.3.0) #> wk 0.7.3 2023-05-06 [1] CRAN (R 4.3.0) #> xfun 0.40 2023-08-09 [1] CRAN (R 4.3.0) #> xml2 1.3.5 2023-07-06 [1] CRAN (R 4.3.0) #> yaml 2.3.7 2023-01-23 [1] CRAN (R 4.3.0) #> #> [1] /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/library #> #> ────────────────────────────────────────────────────────────────────────────── ```

Let me know if that isn't what you're trying to accomplish, but I think this is how you get what you're looking for.

As for

create spatial clusters with equal size

This isn't something we currently support in spatialsample directly. Would you be able to link the package (or paper, or so on) that you're using that has this restriction? What happens if the number of data points are a prime number, and so can't be divided evenly into folds?

What you could do is pass a custom function to the cluster_function argument. That custom function can use whatever logic you want, in order to enforce that all folds are of equal sizes. Hopefully the function documentation (especially the Details section) is helpful in describing what that function needs to accept and return -- but let me know if it isn't and if I can help clarify anything.

mikemahoney218 commented 8 months ago

I'm going to go ahead and close this out -- please feel free to open a new issue if we didn't wind up fixing the core problem here!