rvalavi / blockCV

The blockCV package creates spatially or environmentally separated training and testing folds for cross-validation to provide a robust error estimation in spatially structured environments. See
https://doi.org/10.1111/2041-210X.13107
GNU General Public License v3.0
106 stars 22 forks source link

train test table and folds_ids report different output #33

Closed Baldl closed 1 year ago

Baldl commented 1 year ago

Hi,

first of all, thank you for this amazing package and the really nice update!

I´m encountering some weird behavior when I´m using the function cv_spatial. If I pass more than one class in the "column" argument and one or more of the classes is more clustered in space than the other class(es). For example, the column consists of three values 0,1,2.

The function returns the training and test table stating that each fold contains at least one data record. However, if I have a look at the folds_ids in the end this is not true and less folds than reported have been created for the smaller class(es).

Here is a reproducible example:

library(sf)

sf_1.0-9

library(blockCV)

blockCV_3.0-2

set.seed(123)

presence <- sf::st_as_sf(data.frame( occ = 1, x = runif(100, -75.4, -74), y = runif(100, 39.6, 41)), coords = c("x", "y"), crs = "EPSG:4326" )

absence <- sf::st_as_sf(data.frame( occ = 0, x = runif(100, -75.4, -74), y = runif(100, 39.6, 41)), coords = c("x", "y"), crs = "EPSG:4326" )

background <- sf::st_as_sf(data.frame( occ = 2, x = runif(10000, -80.4, -74), y = runif(10000, 39.6, 41)), coords = c("x", "y"), crs = "EPSG:4326" ) data=rbind(presence, absence, background);rm(presence, absence, background)

blocks <- blockCV::cv_spatial( x = data, column="occ", k = 7L, size=70000 )

data$folds<- blocks$folds_ids dplyr::n_distinct(data[data$occ==0,]$folds) dplyr::n_distinct(data[data$occ==1,]$folds) dplyr::n_distinct(data[data$occ==2,]$folds)

sessionInfo()


SessionInfo: R version 4.2.2 (2022-10-31 ucrt) Platform: x86_64-w64-mingw32/x64 (64-bit) Running under: Windows 10 x64 (build 19045)

Matrix products: default

locale: [1] LC_COLLATE=German_Germany.1252 LC_CTYPE=German_Germany.1252 LC_MONETARY=German_Germany.1252 [4] LC_NUMERIC=C LC_TIME=German_Germany.1252

attached base packages: [1] stats graphics grDevices utils datasets methods base

other attached packages: [1] blockCV_3.0-2 sf_1.0-9

loaded via a namespace (and not attached): [1] Rcpp_1.0.10 rstudioapi_0.14 magrittr_2.0.3 units_0.8-1 munsell_0.5.0 tidyselect_1.2.0
[7] colorspace_2.1-0 R6_2.5.1 rlang_1.0.6 fansi_1.0.4 s2_1.1.2 dplyr_1.1.0
[13] wk_0.7.1 tools_4.2.2 grid_4.2.2 gtable_0.3.1 KernSmooth_2.23-20 utf8_1.2.3
[19] cli_3.6.0 e1071_1.7-13 DBI_1.1.3 withr_2.5.0 class_7.3-20 tibble_3.1.8
[25] lifecycle_1.0.3 farver_2.1.1 ggplot2_3.4.1 vctrs_0.5.2 glue_1.6.2 proxy_0.4-27
[31] compiler_4.2.2 pillar_1.8.1 scales_1.2.1 generics_0.1.3 classInt_0.4-9 pkgconfig_2.0.3


If I´m using the function on just one of the datasets (e.g. class 0,1 OR 3) it works fine and all data records are assigned to a fold, it only occurs when I´m passing all of them to the function.

I´m not sure if I´m just using the function wrong or if it is an actual issue, however some feedback from you would be much appreciated.

I hope my problem is understandable to you, please let me know if you need some clarification.

Best, Lisa

rvalavi commented 1 year ago

Hi Lisa, thanks for the report and your interest in using blockCV. I need to check this in detail. I'll let update you with the results soon.

Baldl commented 1 year ago

Thank you very much!

rvalavi commented 1 year ago

Hi @Baldl

Thanks again for the report. That was actually a bug that I fixed now. It was very hard to find but with an easy fix. Please update to blockCV v3.0.3 and check again.

library(ggplot2)

ggplot() +
  geom_sf(data = data, aes(col = as.factor(occ), alpha = 1 / (occ + 1))) +
  geom_sf(data = blocks$blocks, fill = NA) +
  geom_sf_text(data = blocks$blocks, aes(label = folds))

dplyr::n_distinct(data[data$folds ==1, ]$occ)
dplyr::n_distinct(data[data$folds ==2, ]$occ)
dplyr::n_distinct(data[data$folds ==3, ]$occ)
dplyr::n_distinct(data[data$folds ==4, ]$occ)
dplyr::n_distinct(data[data$folds ==5, ]$occ)
dplyr::n_distinct(data[data$folds ==6, ]$occ)
dplyr::n_distinct(data[data$folds ==7, ]$occ)

image

Please let me know if there are any other issues.

I'm closing this issue.