rvalavi / blockCV

The blockCV package creates spatially or environmentally separated training and testing folds for cross-validation to provide a robust error estimation in spatially structured environments. See
https://doi.org/10.1111/2041-210X.13107
GNU General Public License v3.0
106 stars 22 forks source link

Reproducibility between v2 and v3 #32

Closed pat-s closed 1 year ago

pat-s commented 1 year ago

I am having trouble achieving reproducible fold IDs between spatialBlock() and cv_spatial().

I haven't found any remarks that both versions/functions would not yield identical fold IDs (at least with specific argument settings). Am I doing something wrong?

library(sf)
#> Linking to GEOS 3.11.0, GDAL 3.5.3, PROJ 9.1.0; sf_use_s2() is TRUE
library(blockCV)
#> blockCV 3.0.0

sf_use_s2(use_s2 = FALSE)
#> Spherical geometry (s2) switched off

set.seed(123)
x <- runif(1000, -80.4, -74)
y <- runif(1000, 39.6, 41)

data <- data.frame(
  spp = "test",
  label = factor(round(runif(length(x), 0, 1))),
  x = x,
  y = y
)

data_sf <- sf::st_as_sf(data,
  coords = c("x", "y"),
  crs = "EPSG:4326"
)

sp_block <- suppressWarnings(suppressMessages(
  spatialBlock(
    speciesData = data_sf,
    k = 5,
    rows = 3,
    cols = 4,
    showBlocks = FALSE,
    verbose = FALSE,
    progress = FALSE,
    seed = 42
  )
))

cv_spatial <- suppressWarnings(suppressMessages(
  blockCV::cv_spatial(
    x = data_sf,
    k = 5,
    rows_cols = c(3, 4),
    hexagon = FALSE,
    plot = FALSE,
    verbose = FALSE,
    report = FALSE,
    progress = FALSE,
    seed = 42
  )
))

all.equal(sp_block$foldID, cv_spatial$folds_ids)
#> [1] "Mean relative difference: 0.5878274"

Created on 2023-03-01 with reprex v2.0.2

Session info ``` r sessioninfo::session_info() #> ─ Session info ─────────────────────────────────────────────────────────────── #> setting value #> version R version 4.2.2 (2022-10-31) #> os macOS Ventura 13.2.1 #> system aarch64, darwin20 #> ui X11 #> language (EN) #> collate en_US.UTF-8 #> ctype en_US.UTF-8 #> tz Europe/Zurich #> date 2023-03-01 #> pandoc 3.1 @ /opt/homebrew/bin/ (via rmarkdown) #> #> ─ Packages ─────────────────────────────────────────────────────────────────── #> package * version date (UTC) lib source #> blockCV * 3.0-0 2023-02-06 [1] CRAN (R 4.2.0) #> class 7.3-20 2022-01-16 [2] CRAN (R 4.2.2) #> classInt 0.4-8 2022-09-29 [1] CRAN (R 4.2.0) #> cli 3.6.0 2023-01-09 [1] CRAN (R 4.2.0) #> codetools 0.2-18 2020-11-04 [2] CRAN (R 4.2.2) #> DBI 1.1.3 2022-06-18 [1] CRAN (R 4.2.0) #> digest 0.6.31 2022-12-11 [1] CRAN (R 4.2.0) #> dplyr 1.1.0 2023-01-29 [1] CRAN (R 4.2.0) #> e1071 1.7-13 2023-02-01 [1] CRAN (R 4.2.0) #> evaluate 0.20 2023-01-17 [1] CRAN (R 4.2.0) #> fansi 1.0.4 2023-01-22 [1] CRAN (R 4.2.0) #> fastmap 1.1.0 2021-01-25 [1] CRAN (R 4.2.0) #> fs 1.6.1 2023-02-06 [1] CRAN (R 4.2.0) #> generics 0.1.3 2022-07-05 [1] CRAN (R 4.2.0) #> glue 1.6.2 2022-02-24 [1] CRAN (R 4.2.0) #> htmltools 0.5.4 2022-12-07 [1] CRAN (R 4.2.0) #> KernSmooth 2.23-20 2021-05-03 [2] CRAN (R 4.2.2) #> knitr 1.42 2023-01-25 [1] RSPM (R 4.2.2) #> lattice 0.20-45 2021-09-22 [1] CRAN (R 4.2.1) #> lifecycle 1.0.3 2022-10-07 [1] CRAN (R 4.2.0) #> magrittr 2.0.3 2022-03-30 [1] CRAN (R 4.2.0) #> pillar 1.8.1 2022-08-19 [1] CRAN (R 4.2.1) #> pkgconfig 2.0.3 2019-09-22 [1] CRAN (R 4.2.0) #> proxy 0.4-27 2022-06-09 [1] CRAN (R 4.2.0) #> purrr 1.0.1 2023-01-10 [1] CRAN (R 4.2.0) #> R.cache 0.16.0 2022-07-21 [1] CRAN (R 4.2.0) #> R.methodsS3 1.8.2 2022-06-13 [1] CRAN (R 4.2.0) #> R.oo 1.25.0 2022-06-12 [1] CRAN (R 4.2.0) #> R.utils 2.12.2 2022-11-11 [1] CRAN (R 4.2.0) #> R6 2.5.1 2021-08-19 [1] CRAN (R 4.2.0) #> raster 3.6-14 2023-01-16 [1] CRAN (R 4.2.0) #> Rcpp 1.0.10 2023-01-22 [1] CRAN (R 4.2.0) #> reprex 2.0.2 2022-08-17 [1] CRAN (R 4.2.0) #> rgdal 1.6-4 2023-01-12 [1] CRAN (R 4.2.2) #> rlang 1.0.6 2022-09-24 [1] CRAN (R 4.2.0) #> rmarkdown 2.20 2023-01-19 [1] CRAN (R 4.2.0) #> rstudioapi 0.14 2022-08-22 [1] CRAN (R 4.2.0) #> sessioninfo 1.2.2 2021-12-06 [1] CRAN (R 4.2.0) #> sf * 1.0-9 2022-11-08 [1] CRAN (R 4.2.0) #> sp 1.6-0 2023-01-19 [1] CRAN (R 4.2.0) #> styler 1.9.0.9000 2023-02-13 [1] Github (r-lib/styler@e6fadbf) #> terra 1.7-3 2023-01-24 [1] CRAN (R 4.2.0) #> tibble 3.1.8 2022-07-22 [1] CRAN (R 4.2.0) #> tidyselect 1.2.0 2022-10-10 [1] CRAN (R 4.2.0) #> units 0.8-1 2022-12-10 [1] CRAN (R 4.2.0) #> utf8 1.2.3 2023-01-31 [1] CRAN (R 4.2.0) #> vctrs 0.5.2 2023-01-23 [1] CRAN (R 4.2.0) #> withr 2.5.0 2022-03-03 [1] CRAN (R 4.2.0) #> xfun 0.37 2023-01-31 [1] CRAN (R 4.2.0) #> yaml 2.3.7 2023-01-23 [1] CRAN (R 4.2.0) #> #> [1] /Users/pjs/Library/R/arm64/4.2/library #> [2] /Library/Frameworks/R.framework/Versions/4.2-arm64/Resources/library #> #> ────────────────────────────────────────────────────────────────────────────── ```
rvalavi commented 1 year ago

I have now changed a couple of lines to make the fold numbering also the same, but the issue is that the blocks in v2 are created using the raster package and the v3 using sf. Polygon's indices start from the top-right in the first one and from the bottom-left in the second! Let me check if I can fix it.

pat-s commented 1 year ago

I see. From a reproducibility perspective I'd argue this would be quite important. Otherwise this would be a hard break and users often won't even notice as they will just upgrade and then all blocks will change subsequently.

Maybe there's a way to indicate the starting point in one of the package implementations. Another alternative could be to look into using terra instead of sf for this task and hope for better reproducibility between raster and terra?

If there's no way at all than there should be a top-level message for this in the NEWS file highlighting the incompatibility. What's your view on this and your preferred solution?

I am holding back the mlr3spatiotempcv update until there's a solution for this discussion. NB: there's also a JSS paper (currently in review) attached to it, so appreciated if the discussion could progress actively :)

rvalavi commented 1 year ago

I appreciate your report and encouragement for the reproducibility of the package outputs. I fully support practices to improve reproducibility. This may require some time to test and implement these solutions and check the compatibility of the dependent packages. Unfortunately, my free time is a bit limited in the next couple of weeks. I'll let you with any updates.

rvalavi commented 1 year ago

Using terra for spatial blocks was already in my plan. I think that would be the way to go! I couldn't find a way to reindex the sf object (there might be one, but I could find it).

pat-s commented 1 year ago

@rvalavi Moving forward, I've created a gist which uses {terra} instead of {raster} and the approach from v2 to create the polygons.

https://gist.github.com/pat-s/d2ab388e3c581e6db1a57aeca2ae8e4b

I've followed the rasterNet() function of yours and was able to fully rewrite it to {terra} and check the similarity of the resulting objects. If you still need {sf} down the line, you could parse the {terra} object to {sf} and continue. Yet I would expect the polygons to be compatible with v2 then as they were initially created with the same raster/terra logic (speaking of the indexing issue).

Would that help for you to continue? And maybe push a new version that uses {terra} internally to ensure reproducibility?

Let me know what you think and what your plans are time-wise - a timely reply would help a lot here WRT to communicating with the journal and to users :)

rvalavi commented 1 year ago

Thanks @pat-s for the conversion and testing. I'm free this weekend and I will check your code and match all the results with v2. I was away in the past couple of weeks and did not have much time for it. Having this outside of my current job makes it harder to find free time for changes and updates.

If after the update, the result matches the one with version 2 and is consistent with your package I will submit an update to CRAN soon.

pat-s commented 1 year ago

Thanks @pat-s for the conversion and testing. I'm free this weekend and I will check your code and match all the results with v2.

Thanks, appreciated!

Having this outside of my current job makes it harder to find free time for changes and updates.

Same here, so no worries. We're all in the same boat ;) Thanks for considering my thoughts and the rewrite approach, I know it's additional work for you but I guess quite important in many ways 🙂

rvalavi commented 1 year ago

Hi @pat-s the transition has been made and now the results completely match. Not an easy solution but a better one :D

I also added another parameter named extend that allows enlarging the blocks to make sure all points fall within blocks.

I'm closing this issue and make a submission to CRAN in the next couple of days.

Thanks again for the report!

pat-s commented 1 year ago

That's good news @rvalavi, thanks for your time and understanding! 🎉