tidymodels / planning

Documents to plan and discuss future development
MIT License
37 stars 4 forks source link

Using tidymodels to predict species distributions #26

Closed jamesgrecian closed 1 year ago

jamesgrecian commented 1 year ago

Species Distribution Models or Ecological Niche Models are a broad range of tools for estimating the distribution of a focal species based on (typically) environmental covariates.

There are a number of specific R packages available to fit these, typically focused on specific statistical methodologies (e.g. dismo).

I can see broad interest from the SDM community for using tidymodels as a route to exploring a broad range of statistical models/ engines within a familiar tidy workflow. For example, I am currently trying to build an ensemble of 4 models (lm, gam, rf and brt) using a tidymodel approach to estimate the distribution of ~30 marine species. One reason for choosing a tidy approach is being able to train models using spatial cross-validation via the spatialsample package.

However, I'm struggling to understand how best to generate predictions from tidymodel workflows that would fit back into the standard SDM workflow.

Typically we check the interpretability of a model through (1) partial dependence plots of specific covariates and (2) by generating spatial predictions from models using covariates from the geographic domain of interest e.g. raster::predict or terra::predict.

I can see a route to do (1) via DALEX following the Mario Kart guide from @juliasilge here but am struggling to find documentation on how to do (2). Can I pass the workflow object directly to predict or should I create an explainer object first?

Here's some pseudo-code for my current workflow for 2:


# fit the final model to the data (after cross-validation and tuning)
final_fit <- final_model |> fit(my_data)

# generate DALEX explainer
rf_explainer <- explain_tidymodels(final_fit, data = my_data, y = my_data$response)

# load a covariate raster stack and convert to xyz tibble
covs <- readRDS("data/covariate_stack.rds") |> 
     rasterToPoints(covs) |> 
     as_tibble()

# generate predictions for each xyz from the covariate stack
spatial_predictions <- predict(rf_explainer, test)
mikemahoney218 commented 1 year ago

Howdy @jamesgrecian -- sorry to miss this originally, this repo doesn't see a ton of activity.

If I understand your question correctly: you can pass workflows and explainers to terra::predict() pretty easily. Depending on your workflow, you might need to write a wrapper function (for the fun argument) that extracts a vector of predictions from the workflow output, or use the index argument to subset the returned data frame:

## Packages and other setup:
set.seed(123)

library(sf)
#> Linking to GEOS 3.11.1, GDAL 3.6.2, PROJ 9.1.1; sf_use_s2() is TRUE
library(terra)
#> terra 1.7.29
library(DALEXtra)
#> Loading required package: DALEX
#> Welcome to DALEX (version: 2.4.3).
#> Find examples and detailed introduction at: http://ema.drwhy.ai/
library(tidymodels)

## Data prep:
# pak::pkg_install("Nowosad/spDataLarge")
data("lsl", "study_mask", package = "spDataLarge")
ta <- terra::rast(system.file("raster/ta.tif", package = "spDataLarge"))
lsl <- lsl |> 
  st_as_sf(coords = c("x", "y"), crs = "EPSG:32717")

## Fit some workflow 
glm_model <- logistic_reg() |> 
  set_engine("glm") |> 
  set_mode("classification")

glm_wflow <- workflow() |> 
  add_formula(lslpts ~ slope + cplan + cprof + elev + log10_carea) |> 
  add_model(glm_model) |> 
  fit(lsl)

glm_explainer <- explain_tidymodels(
  glm_wflow, 
  data = lsl, 
  y = as.logical(as.character(lsl$lslpts))
)
#> Preparation of a new explainer is initiated
#>   -> model label       :  workflow  (  default  )
#>   -> data              :  350  rows  7  cols 
#>   -> target variable   :  350  values 
#>   -> predict function  :  yhat.workflow  will be used (  default  )
#>   -> predicted values  :  No value for predict function target column. (  default  )
#>   -> model_info        :  package tidymodels , ver. 1.0.0 , task classification (  default  ) 
#>   -> model_info        :  Model info detected classification task but 'y' is a logical . Converted to numeric.  (  NOTE  )
#>   -> predicted values  :  numerical, min =  0.00233623 , mean =  0.5 , max =  0.9858769  
#>   -> residual function :  difference between y and yhat (  default  )
#>   -> residuals         :  numerical, min =  -0.9858769 , mean =  -7.765776e-10 , max =  0.9780077  
#>   A new explainer has been created!

## Predicting is easy via terra:
terra::predict(ta, glm_explainer) |> plot()


## Though you might need to wrap the prediction function 
## to guarantee you get a vector of predictions back:
try(terra::predict(ta, glm_wflow))
#> Error in out[[i]] <- data.frame(value = 1:length(out[[i]]), label = out[[i]]) : 
#>   more elements supplied than there are to replace
terra::predict(
  ta, 
  glm_wflow, 
  fun = \(model, object) predict(model, object)$.pred_class
) |> 
  plot()

# Or alternatively, use the `index` argument to select which columns from the output should be rasterized. Note that we lost the data type here, though, because `glm_wflow` was returning factors which got auto-converted into integers:
terra::predict(
  ta, 
  glm_wflow,
  index = 1
) |> plot()

Created on 2023-04-26 with reprex v2.0.2

Does that answer what you're asking?

jamesgrecian commented 1 year ago

Thanks @mikemahoney218, this is great!

I was getting confused before because when using explain_tidymodels and model_profile to generate a partial dependence plot the outputs will differ with each call to the functions depending on the number of samples used.

I'm used to simply getting the mean prediction from a model prediction object. Based on the DALEX documentation it wasn't clear what to do in this case.

Thanks again