rvalavi / blockCV

The blockCV package creates spatially or environmentally separated training and testing folds for cross-validation to provide a robust error estimation in spatially structured environments. See
https://doi.org/10.1111/2041-210X.13107
GNU General Public License v3.0
109 stars 24 forks source link

Error in wk_handle.wk_wkb in cv_spatial #38

Closed zhangzhixin1102 closed 1 year ago

zhangzhixin1102 commented 1 year ago

Dear @rvalavi, Thank you for building such a helpful R package. Recently, I want to use spatial cross validation in my SDMs. When I run the blockCV::cv_spatial() function, however, I encounter an error. Below are my scripts and the error information.

` library(sf) ## package version 1.0.1 library(blockCV) ## package version 3.0.0

pres.locs.df <- read.csv(file = "species_data.csv") pres.locs.sf <- sf::st_as_sf(pres.locs.df, coords = c("lon", "lat"), crs = 4326)

scv <- blockCV::cv_spatial(x = pres.locs.sf, selection = "random", # random blocks-to-fold k = 5, iteration = 50, biomod2 = FALSE)`

The error message is as follows:

Error in wk_handle.wk_wkb(wkb, s2_geography_writer(oriented = oriented, : Loop 0 is not valid: Edge 0 crosses edge 3 In addition: Warning messages: 1: In st_is_longlat(x) : bounding box has potentially an invalid value range for longlat data 2: In st_is_longlat(x) : bounding box has potentially an invalid value range for longlat data

I attached my data here. Could you please fix my problem? Thank you for your kind help. Chou

species_data.csv

rvalavi commented 1 year ago

Dear @zhangzhixin1102 thank you for your report and interest in blockCV. The issue is due the fact that your data is a global dataset and sf package has difficulty making hexagon polygons. One simple solution would be using rectangular blocks by addin hexagon = FALSE. Let me know if you need the hexagon blocks, I can check in more depth.

zhangzhixin1102 commented 1 year ago

Dear @rvalavi , Thank you so much for the prompt reply. Following your suggestions, I add hexagon = FALSE argument, and the error was fixed. Thank you again for your kind help. Best, Chou

rvalavi commented 1 year ago

I'm glad that was helpful. For hexagon blocks, you can turn off the spherical geometry of the sf package by sf::sf_use_s2(use_s2 = FALSE) before you run cv_spatial. Check to see which option is sensible in your case. I guess both should be fine...

zhangzhixin1102 commented 1 year ago

@rvalavi , you are right. I tested the two solotions, both work well. So I close this issue. Thank you again for your help.

zhangzhixin1102 commented 1 year ago

Hi @rvalavi , In the previous example, the two methods work fine, so I used sf::sf_use_s2(use_s2 = FALSE) in my study. But after running this for multiple species, it seems that the second approach (hexagon = FALSE) is better. Please check the below example.

`library(sf) ## package version 1.0.1 library(blockCV) ## package version 3.0.0

pres.locs.df <- read.csv(file = "C:/Users/zhang/Downloads/data2.csv") colnames(pres.locs.df) <- c("lon", "lat") ## change column name maps::map() points(pres.locs.df, pch = 20, col = "red")

pres.locs.sf <- sf::st_as_sf(pres.locs.df, coords = c("lon", "lat"), crs = 4326)

sf::sf_use_s2(use_s2 = FALSE) scv <- blockCV::cv_spatial(x = pres.locs.sf, selection = "random", # random blocks-to-fold k = 5,

hexagon = FALSE,

                       iteration = 50,
                       biomod2 = FALSE)`

If we run above scripts, we get the following error message:

although coordinates are longitude/latitude, st_intersects assumes that they are planar Error in blockCV::cv_spatial(x = pres.locs.sf, selection = "random", k = 5, : 'k' is bigger than the number of spatial blocks: 4. In addition: Warning message: In st_is_longlat(x) : bounding box has potentially an invalid value range for longlat data

But if we use the second approach, it works fine:

`library(sf) ## package version 1.0.1 library(blockCV) ## package version 3.0.0

pres.locs.df <- read.csv(file = "C:/Users/zhang/Downloads/data2.csv") colnames(pres.locs.df) <- c("lon", "lat") maps::map() points(pres.locs.df, pch = 20, col = "red")

pres.locs.sf <- sf::st_as_sf(pres.locs.df, coords = c("lon", "lat"), crs = 4326) sf::sf_use_s2(use_s2 = TRUE) scv <- blockCV::cv_spatial(x = pres.locs.sf, selection = "random", # random blocks-to-fold k = 5, hexagon = FALSE, iteration = 50, biomod2 = FALSE)` But we has warning message as below:

Warning message: In blockCV::cv_spatial(x = pres.locs.sf, selection = "random", k = 5, : At least 2 of the points are not within the defined spatial blocks

So based on above results, it seems safe to use hexagon = FALSE when runing blockCV for a large number of species. What do you think? data2.csv

rvalavi commented 1 year ago

Hi @zhangzhixin1102 Thanks for the follow-up check.

Your first error is not due to hexagon blocks! It's probably because the data for that species is sparse and when you use the default setting for cv_spatial with a default rows_cols = c(10, 10), the number of created blocks is less than the number of folds (in your case k = 5). The error message says: 'k' is bigger than the number of spatial blocks: 4.

Choosing the size of blocks is a bit tricky and depends on the study aims and species data. I highly recommend reading blockCV's paper and its cited literature like Roberts et al. 2017 Ecography. You can define the block size by the size argument in meters or the rows_cols (the default is rows_cols = c(10, 10)). You can explore this by the cv_block_size function to see how the blocks look like with any specific block size.

The hexagon blocks are based on the sf package that is well aware of spatial coordinate systems and for a global dataset that covers the whole globe might show issue like invalid coordinate values as mentioned in the warning message: In st_is_longlat(x) : bounding box has potentially an invalid value range for longlat data

The second example shows a serious warning: At least 2 of the points are not within the defined spatial blocks. This can result in missing indices in your fold IDs.

My recommendation is: 1- Update blockCV to the latest version i.e. v3.1-2 2- Use cv_block_size on your data to define a sensible number of rows and columns for our blocks (for rows_cols argument) 3- Use cv_spatial with rectangular blocks i.e. hexagon = FALSE 4- Use the new argument extend = 2 (or a bit higher) to make sure all your points fall inside the blocks (make sure you don't get At least x of the points are not within the defined spatial blocks warning) 5- If all these don't solve your problem, I recommend using cv_cluster instead to make the spatial cluster of points for fold creation. That is not\less sensitive to the coordinate reference system. The cv_cluster has two modes, one requires rasters files, and the other doesn't. Use it WITHOUT raster files. See the example on the main page of the package. See the vignettes here: https://github.com/rvalavi/blockCV/tree/master#vignettes