`cv_buffer()` returns correct number of folds but not all contain a presence point

rvalavi / blockCV

The blockCV package creates spatially or environmentally separated training and testing folds for cross-validation to provide a robust error estimation in spatially structured environments. See

https://doi.org/10.1111/2041-210X.13107

GNU General Public License v3.0

106 stars 22 forks source link

`cv_buffer()` returns correct number of folds but not all contain a presence point #42

Closed bcknr closed 7 months ago

bcknr commented 7 months ago

Hello - I am trying to using blockCV to create folds for multiple species using cv_buffer and cv_spatial which I'm using to fit RF and GLM models. When I use cv_buffer (see below) the number of folds returned is equal to the number of presence points, however, some folds contain no presence locations ('1' in occ_spp) and only background records (0s).

Here is the code used to call cv_buffer. It is part of a larger loop that creates folds for many individual species-level models.

      range <- cv_spatial_autocor(rast(sa_path), x = occ_spp,
                                  column = colnames(occ_sf)[i],
                                  plot = FALSE, progress = FALSE)

      folds <- cv_buffer(x = occ_spp, column = colnames(occ_sf)[i], 
                         presence_bg = TRUE, add_bg = TRUE, size = range$range,
                         progress = FALSE, report = FALSE)$folds_list

I'm inclined to assume that I've parameterized something wrong, so I would appreciate guidance. If needed I can share the data for one of the species that shows this behavior.

Thank you!

rvalavi commented 7 months ago

Hi @bcknr, What you're doing is correct and the cv_buffer behavior seems correct to me. This is probably the reality of your data. I assume your species records are closer to each other than the range of spatial autocorrelation in your raster data.

A couple of notes:

Sometimes the spatial autocorrelation ranger in raster covariates is very high, especially for interpolated global climate data. So, I recommend using something else here.
An alternative approach would be using cv_nndm instead of cv_buffer. They are very similar but in cv_nndm the buffer size is adaptive and it tries to match it with the similar distances that you're going to predict to. The largest buffer size still be the size you provide but there is a higher chance that it can solve your problem (unless your species data are too clustered).

rvalavi commented 7 months ago

BTW, you can just use the path to your rasters for the r arguments in blockCV, e.g. range <- cv_spatial_autocor(sa_path, ...) instead of range <- cv_spatial_autocor(rast(sa_path), ...). Whichever you prefer.

bcknr commented 7 months ago

Thanks, @rvalavi. That all make sense, the data I am working with has come with quite a few challenges. I appreciate the suggestions! cv_nndm did work better for many of the species that I am having trouble with.

rvalavi commented 7 months ago

No problem, @bcknr . I'm glad that it helped. Check the variograms of your cv_spatial_autocor. It might give you a sense of why your range is high. If nothing could solve your issues, my final suggestion would be reducing the size which is not ideal but could add more species to your analysis. There is a trade-off here anyway...

Good luck! I'm closing this issue.