rvalavi / blockCV

The blockCV package creates spatially or environmentally separated training and testing folds for cross-validation to provide a robust error estimation in spatially structured environments. See
https://doi.org/10.1111/2041-210X.13107
GNU General Public License v3.0
106 stars 22 forks source link

`cv_buffer()` returns correct number of folds but not all contain a presence point #42

Closed bcknr closed 7 months ago

bcknr commented 7 months ago

Hello - I am trying to using blockCV to create folds for multiple species using cv_buffer and cv_spatial which I'm using to fit RF and GLM models. When I use cv_buffer (see below) the number of folds returned is equal to the number of presence points, however, some folds contain no presence locations ('1' in occ_spp) and only background records (0s).

Here is the code used to call cv_buffer. It is part of a larger loop that creates folds for many individual species-level models.

      range <- cv_spatial_autocor(rast(sa_path), x = occ_spp,
                                  column = colnames(occ_sf)[i],
                                  plot = FALSE, progress = FALSE)

      folds <- cv_buffer(x = occ_spp, column = colnames(occ_sf)[i], 
                         presence_bg = TRUE, add_bg = TRUE, size = range$range,
                         progress = FALSE, report = FALSE)$folds_list

I'm inclined to assume that I've parameterized something wrong, so I would appreciate guidance. If needed I can share the data for one of the species that shows this behavior.

Thank you!

rvalavi commented 7 months ago

Hi @bcknr, What you're doing is correct and the cv_buffer behavior seems correct to me. This is probably the reality of your data. I assume your species records are closer to each other than the range of spatial autocorrelation in your raster data.

A couple of notes:

rvalavi commented 7 months ago

BTW, you can just use the path to your rasters for the r arguments in blockCV, e.g. range <- cv_spatial_autocor(sa_path, ...) instead of range <- cv_spatial_autocor(rast(sa_path), ...). Whichever you prefer.

bcknr commented 7 months ago

Thanks, @rvalavi. That all make sense, the data I am working with has come with quite a few challenges. I appreciate the suggestions! cv_nndm did work better for many of the species that I am having trouble with.

rvalavi commented 7 months ago

No problem, @bcknr . I'm glad that it helped. Check the variograms of your cv_spatial_autocor. It might give you a sense of why your range is high. If nothing could solve your issues, my final suggestion would be reducing the size which is not ideal but could add more species to your analysis. There is a trade-off here anyway...

Good luck! I'm closing this issue.