rvalavi / blockCV

The blockCV package creates spatially or environmentally separated training and testing folds for cross-validation to provide a robust error estimation in spatially structured environments. See
https://doi.org/10.1111/2041-210X.13107
GNU General Public License v3.0
106 stars 22 forks source link

Integration of partitioning functions into mlr #1

Closed pat-s closed 4 years ago

pat-s commented 5 years ago

Hi RV (and all other authors),

thanks for this great work - a package that is very much needed in the spatial modeling community! Also it is great that you published it on Github and maintain it openly! I am the maintainer of sperrorest and also author of mlr - a modeling framework similar to biomod2 that you show as the last example in the vignette. While biomod2 is tailored towards the species distribution com, it would be great to also have the functionality in a framework addressing the whole spatial modeling com, i.e. mlr.

In mlr we have the "k-means clustering" approach from sperrorest integrated since a few months: http://mlr-org.github.io/mlr/articles/tutorial/devel/handling_of_spatial_data.html mlr is together with caret the most popular modeling-framework package in R. It would be great if we could join efforts and integrate parts of the functionality into mlr. Combining efforts is also one of the reasons why we decided to deprecate sperrorest and integrate its functionality into a bigger framework that is maintained by more persons. What is your opinion on this?

I am very happy to support you in this direction and help with possible issues.

For an example for using mlr in spatial modeling see also https://arxiv.org/abs/1803.11266.

jannes-m commented 5 years ago

@rvalavi it's a shame that we didn't have time to chat yesterday. I didn't know that you were the creator of the blockCV package. But at least you got a better idea of the mlr approach yesterday. Integrating you approach into mlr as well would have the advantage that your approach could be used in hundreds of different modeling approaches in a streamlined way.

rvalavi commented 5 years ago

@pat-s Hi Patrick,

Sorry for the late reply. I don't know how I had not received any notification about this comment! That's a great idea! Indeed we tried to create blockCV in a way that could be useful in general for spatial modelling.

Yesterday, @jannes-m had an impressive presentation in GEOSTAT workshop in Prague. It's interesting how you implemented spatial cross-validation in mlr package. I completely agree with you, these functionalities should be more available to a wider range of users.

I would be glad to collaborate with this process. I am happy to make any changes needed for this integration.

rvalavi commented 5 years ago

@jannes-m thank you for your great presentations yesterday. Yes, it was an intensive program and there was little chatting time. Great idea! I would be happy to help with integrating the functionality of blockCV into mlr. I try to add more approaches to the package and maybe in the future methods for handling spatiao-temporal data.

I also would be glad to hear any comments and new suggestions regarding the current state of the package.

pat-s commented 5 years ago

@rvalavi Great that you also see the value here!

A collection of spatial resamling methods that are easily available to users is missing for too long already!

Your package makes a great start by providing the methods in the first place. To have even more impact, they should be integrated in frameworks like mlr :)

Its probably easiest to have a voice conv regarding all of this. If you're up to, just write Jannes and me a mail to schedule one.

A few comments about the methods:

spatial blocking

In mlr we already have "blocking". It means using pre-defined indices in resampling that should not be separated during fold creation. The user can also set a flag that completely uses the pre-defined indices for partitioning.

Your "blocking" idea is a bit different as it creates spatial blocks. To integrate it in mlr, we would definitely need to rename it. Not sure how big this problem would be as you refer to this name in your publication most likely.

Also you require raster objects to create the spatial blocks. I see the need for it in your example in the vignette. The implementation will not be so simple as we would also need to pass a raster layer in addition to the coordinates.

environmental block

The "environmental block" is already implemented in mlr following the idea of Brenning (2012).

I am wondering how much confusion multiple spatial clustering methods with only small differences will trigger in the spatial modeling world. You also require a raster layer for the "environmental block" which makes it a bit more difficult.

Since variables with wider ranges of values might dominate the clusters and bias the environmental clustering (Hastie et al., 2009), all the input rasters are first standardized within the function.

You do the clustering on the input variables, Brenning2012 uses the coordinates. Have you ever considered the approach after Brenning2012 in detail? Is there an advantage using the coveriates instead of the coordinates for the clustering?

buffering

Should be the easiest method to implement. This could be a good starting point.

pat-s commented 4 years ago

Update: We implemented all of of blockCV's resampling functions into mlr3spatiotemporal.

Things are not completely done yet. We need some examples and polish everything. I'll let you know once we are ready. Just FYI, we also support visualization.

Since we would like to release to CRAN at some point, how are your plans regarding this? Since we depend on your pkg, yours would need to go first. Otherwise we would need to copy all of your code to be able to release to CRAN (which I would like to avoid ofc).

rvalavi commented 4 years ago

Hi @pat-s and @be-marc

Thank you for writing the codes. The visualisation looks nice. I had a very quick look at the code, looks good, but I might be able to help you improve it.

I had a plan to push the blockCV to CRAN. I try to do this in the next couple of weeks. I also want to update all spatial functions to sf functions. I don't think this causes any problem for mlr SpCV functions.

Please let me know if you need any help.

Regards, Roozbeh

pat-s commented 4 years ago

I had a very quick look at the code, looks good, but I might be able to help you improve it.

Improvements are welcome any time, just open a PR :)

I had a plan to push the blockCV to CRAN. I try to do this in the next couple of weeks. I also want to update all spatial functions to sf functions. I don't think this causes any problem for mlr SpCV functions.

Sounds good, looking forward to it :)

rvalavi commented 4 years ago

Sure! I will keep you updated :)

rvalavi commented 4 years ago

Hi @pat-s and @be-marc

After 10 days of coding, the blockCV is finally updated! I almost wrote the package from scratch. I tried to keep the package output consistent with the previous version.

Could you check the consistency with your mlr code? I will push it to CRAN as soon as you give me feedback.

FYI: the function spatialBlock now can search for evenly distributed records in training and testing folds for binary and multi-class responses.

pat-s commented 4 years ago

Thanks! I'll have it on my list - though I will be busy traveling in the next two weeks so I do not know when I will have time to get to it.

rvalavi commented 4 years ago

No worries! All the outputs are generated in the same format and the same name. No argument has changed. So I don't think there will be any inconsistency with mlr.

pat-s commented 4 years ago

For all future readers: {blockCV} functions are supported in https://github.com/mlr-org/mlr3spatiotempcv.