rvalavi / blockCV

The blockCV package creates spatially or environmentally separated training and testing folds for cross-validation to provide a robust error estimation in spatially structured environments. See
https://doi.org/10.1111/2041-210X.13107
GNU General Public License v3.0
106 stars 22 forks source link

Reproducibility v2 and v3 - take2 #35

Closed pat-s closed 1 year ago

pat-s commented 1 year ago

(sorry for opening another issue but I can't reopen the old one)

@rvalavi I've done some testing on my side and atm I think there is no reproducibility between v2 and v3 yet. I've created the below reprex to showcase it.

Also I see in the release notes of 3.1.1 that spatialBlock() uses cv_spatial() now internally. This sounds like it uses the new version in it's core and only wraps it - and is not actually executing the function in the way it was done in v2.1.4? Arguably, this is even more troublesome for users as they might assume that spatialBlock() yields the "old" results still while in fact it doesn't and instead "only" wraps cv_spatial() and does something else.

The reprex below sets seed = 42 and hexagon=FALSE which are two defaults that have changed in the new version but were not specifically highlighted in the changelog. But even with these I am unable to reproduce the old indices.

Am I doing something wrong? How can I reproduce the v2 results using the toy SF data in the reprex?

remotes::install_version("blockCV", "2.1.4")
#> Trying https://stat.ethz.ch/CRAN/
#> Downloading package from url: https://stat.ethz.ch/CRAN//src/contrib/Archive/blockCV/blockCV_2.1.4.tar.gz
#> Installing package into '/Users/pjs/Library/R/arm64/4.2/library'
#> (as 'lib' is unspecified)

library(blockCV)
packageVersion("blockCV")
#> [1] '2.1.4'
library(sf)
#> Linking to GEOS 3.11.0, GDAL 3.5.3, PROJ 9.1.0; sf_use_s2() is TRUE

set.seed(123)
x <- runif(1000, -80.4, -74)
y <- runif(1000, 39.6, 41)

data <- data.frame(
  spp = "test",
  label = factor(round(runif(length(x), 0, 1))),
  x = x,
  y = y
)

data_sf <- sf::st_as_sf(data,
  coords = c("x", "y"),
  crs = "EPSG:4326"
)

# spatial blocking by specified range and random assignment
sb1 <- spatialBlock(
  speciesData = data_sf,
  theRange = 70000,
  k = 5,
  selection = "random",
  progress = FALSE,
  verbose = FALSE,
  showBlocks = FALSE
)

sb1$foldID[1:10]
#>  [1] 1 5 3 5 5 2 5 1 4 5

detach("package:blockCV", unload = TRUE)

install.packages("blockCV")
#> Installing package into '/Users/pjs/Library/R/arm64/4.2/library'
#> (as 'lib' is unspecified)
#> 
#> The downloaded binary packages are in
#>  /var/folders/nr/x23mhfm55616f3w8xd0lwmdh0000gn/T//RtmpXgoBt1/downloaded_packages
packageVersion("blockCV")
#> [1] '3.1.1'
library(blockCV)
#> Warning in get(method, envir = home): internal error -3 in R_decompress1

#> Warning in get(method, envir = home): internal error -3 in R_decompress1

#> Warning in get(method, envir = home): internal error -3 in R_decompress1

#> Warning in get(method, envir = home): internal error -3 in R_decompress1

#> Warning in get(method, envir = home): internal error -3 in R_decompress1

#> Warning in get(method, envir = home): internal error -3 in R_decompress1

#> Warning in get(method, envir = home): internal error -3 in R_decompress1

#> Warning in get(method, envir = home): internal error -3 in R_decompress1

#> Warning in get(method, envir = home): internal error -3 in R_decompress1

#> Warning in get(method, envir = home): internal error -3 in R_decompress1
#> blockCV 3.1.1

set.seed(123)
x <- runif(1000, -80.4, -74)
y <- runif(1000, 39.6, 41)

data <- data.frame(
  spp = "test",
  label = factor(round(runif(length(x), 0, 1))),
  x = x,
  y = y
)

data_sf <- sf::st_as_sf(data,
  coords = c("x", "y"),
  crs = "EPSG:4326"
)

sb2 <- blockCV::cv_spatial(
  x = data_sf,
  size = 70000,
  k = 5,
  selection = "random",
  progress = FALSE,
  plot = FALSE,
  hexagon = FALSE, # required for reproducibility to old version
  seed = 42 # required for reproducibility to old version
)
#> 
#>   train test
#> 1   821  179
#> 2   846  154
#> 3   758  242
#> 4   791  209
#> 5   784  216

sb2$folds_ids[1:10]
#>  [1] 3 4 3 2 2 4 4 3 1 3

Created on 2023-04-13 with reprex v2.0.2

rvalavi commented 1 year ago

Based on tests it was the same. This shouldn't be a big problem. I'll fix it.

pat-s commented 1 year ago

@rvalavi Just checking in, were you able to do some tests already? Or can you give a rough outline when you might be able to take a look?

rvalavi commented 1 year ago

@pat-s sorry, caught again in a busy time. I'll fix it this week.

rvalavi commented 1 year ago

It seems to be in the way set.seed is internally implemented. Last time I checked with the latest update of the code and it was ok, but it seems that code was already updated. I'm checking to see how I can make them identical.

rvalavi commented 1 year ago

@pat-s after a few attempts the results finally match. It was not a difficult problem but tricky to find. I also increased the iteration default to 100 to match the v2.1.4

If all is good I go ahead and make an update to CRAN.

pat-s commented 1 year ago

Thanks! I was just about to check but seems #37 needs to be resolved first. It looks like a valid error to me on the first look but I might be wrong.

pat-s commented 1 year ago

Hi Roozbeh,

I've taken another look and my tests show indeed reproducibility with v2! I haven't checked all subfunctions yet (Only spatialBlock and spatialEnv) but I think it's good now :) Otherwise I'll comment here again but I think you can go ahead WRT to a new release.

Thanks again for your patience and understanding!

rvalavi commented 1 year ago

Hi Patrick,

I'm glad that the new version worked. I'm always up for promoting reproducibility. Thanks again for checking and reporting issues.