tidymodels / spatialsample

Create and summarize spatial resampling objects 🗺
https://spatialsample.tidymodels.org
Other
71 stars 5 forks source link

Using spatialsample in mgcv::gam #139

Closed jamesgrecian closed 1 year ago

jamesgrecian commented 1 year ago

Hi @mikemahoney218,

I've been using the spatialsample package to perform spatial cross-validation on an ensemble of species distribution models. However, I've got stuck when trying to fit a GAM using mgcv::gam.

It looks like the tidymodel workflow set up for fitting GAMs is a little different to linear models, RFs of BRTs. So I don't know if this is an issue with passing spatialsample objects to GAMs or an issue with how I've formulated the gam workflow...?

Here's an example attempt to fit a gam to a spatialsample cross validation object using tidy models. You'll see the linear regression model workflow works, but the only way I can get a gam to work is to use lapply and fit.

Thanks,

James

# load libraries
library(tidymodels)
library(spatialsample)

# define a linear model
lm_spec <-
  linear_reg() |>
  set_mode("regression")

# this works with linear model
workflow() |> 
  add_model(lm_spec) |> 
  add_formula(canopy_area_2019 ~ land_area * mean_temp) |> 
  fit_resamples(vfold_cv(spatialsample::boston_canopy))
#> # Resampling results
#> # 10-fold cross-validation 
#> # A tibble: 10 × 4
#>    splits           id     .metrics         .notes          
#>    <list>           <chr>  <list>           <list>          
#>  1 <split [613/69]> Fold01 <tibble [2 × 4]> <tibble [0 × 3]>
#>  2 <split [613/69]> Fold02 <tibble [2 × 4]> <tibble [0 × 3]>
#>  3 <split [614/68]> Fold03 <tibble [2 × 4]> <tibble [0 × 3]>
#>  4 <split [614/68]> Fold04 <tibble [2 × 4]> <tibble [0 × 3]>
#>  5 <split [614/68]> Fold05 <tibble [2 × 4]> <tibble [0 × 3]>
#>  6 <split [614/68]> Fold06 <tibble [2 × 4]> <tibble [0 × 3]>
#>  7 <split [614/68]> Fold07 <tibble [2 × 4]> <tibble [0 × 3]>
#>  8 <split [614/68]> Fold08 <tibble [2 × 4]> <tibble [0 × 3]>
#>  9 <split [614/68]> Fold09 <tibble [2 × 4]> <tibble [0 × 3]>
#> 10 <split [614/68]> Fold10 <tibble [2 × 4]> <tibble [0 × 3]>

# define a gam
gam_spec <- 
  gen_additive_mod() |> 
  set_engine("mgcv") |>
  set_mode("regression")

# this doesn't work with a gam
workflow() |>
  add_model(gam_spec) |>
  add_formula(canopy_area_2019 ~ s(land_area) + s(mean_temp)) |> 
  fit_resamples(vfold_cv(spatialsample::boston_canopy))
#> → A | error:   invalid type (list) for variable 's(land_area)'
#> There were issues with some computations   A: x1
#> There were issues with some computations   A: x10
#> 
#> Warning: All models failed. Run `show_notes(.Last.tune.result)` for more
#> information.
#> # Resampling results
#> # 10-fold cross-validation 
#> # A tibble: 10 × 4
#>    splits           id     .metrics .notes          
#>    <list>           <chr>  <list>   <list>          
#>  1 <split [613/69]> Fold01 <NULL>   <tibble [1 × 3]>
#>  2 <split [613/69]> Fold02 <NULL>   <tibble [1 × 3]>
#>  3 <split [614/68]> Fold03 <NULL>   <tibble [1 × 3]>
#>  4 <split [614/68]> Fold04 <NULL>   <tibble [1 × 3]>
#>  5 <split [614/68]> Fold05 <NULL>   <tibble [1 × 3]>
#>  6 <split [614/68]> Fold06 <NULL>   <tibble [1 × 3]>
#>  7 <split [614/68]> Fold07 <NULL>   <tibble [1 × 3]>
#>  8 <split [614/68]> Fold08 <NULL>   <tibble [1 × 3]>
#>  9 <split [614/68]> Fold09 <NULL>   <tibble [1 × 3]>
#> 10 <split [614/68]> Fold10 <NULL>   <tibble [1 × 3]>
#> 
#> There were issues with some computations:
#> 
#>   - Error(s) x10: invalid type (list) for variable 's(land_area)'
#> 
#> Run `show_notes(.Last.tune.result)` for more information.

# try writing formula more explicitly and adding variables
workflow() |>
  add_variables(outcomes = c(canopy_area_2019),
                predictors = c(land_area, mean_temp)) |>
  add_model(gam_spec,
            formula = canopy_area_2019 ~ s(land_area) + s(mean_temp)) |> 
  fit_resamples(vfold_cv(spatialsample::boston_canopy))
#> → A | error:   Not all columns of `y` are known outcome types. These columns have unknown types: 'geometry'.
#> Warning: All models failed. Run `show_notes(.Last.tune.result)` for more
#> information.
#> # Resampling results
#> # 10-fold cross-validation 
#> # A tibble: 10 × 4
#>    splits           id     .metrics .notes          
#>    <list>           <chr>  <list>   <list>          
#>  1 <split [613/69]> Fold01 <NULL>   <tibble [1 × 3]>
#>  2 <split [613/69]> Fold02 <NULL>   <tibble [1 × 3]>
#>  3 <split [614/68]> Fold03 <NULL>   <tibble [1 × 3]>
#>  4 <split [614/68]> Fold04 <NULL>   <tibble [1 × 3]>
#>  5 <split [614/68]> Fold05 <NULL>   <tibble [1 × 3]>
#>  6 <split [614/68]> Fold06 <NULL>   <tibble [1 × 3]>
#>  7 <split [614/68]> Fold07 <NULL>   <tibble [1 × 3]>
#>  8 <split [614/68]> Fold08 <NULL>   <tibble [1 × 3]>
#>  9 <split [614/68]> Fold09 <NULL>   <tibble [1 × 3]>
#> 10 <split [614/68]> Fold10 <NULL>   <tibble [1 × 3]>
#> 
#> There were issues with some computations:
#> 
#>   - Error(s) x10: Not all columns of `y` are known outcome types. These columns hav...
#> 
#> Run `show_notes(.Last.tune.result)` for more information.

# this works but is far from ideal
fit_list <- lapply(vfold_cv(spatialsample::boston_canopy)$splits,
       FUN = function(x) gam_spec |> fit(canopy_area_2019 ~ s(land_area) + s(mean_temp),
                                         analysis(x)))

Created on 2023-05-03 with reprex v2.0.2

mikemahoney218 commented 1 year ago

I'm personally not super familiar with mgcv, so I asked around for help on this one, and @simonpcouch was awesome enough to explain what's going on here for me :smile: That also means I can't give a ton of other details here myself (though I can ask around and try to find out!) as I don't know the internals of what's happening here.

But, basically: when using mgcv you need to provide two formula arguments. The first one in add_model() is the formula that mgcv will see, and that's where you define your smooths. The second formula goes in add_formula(), and that's what functions from recipes (and other pre-processors) will see. Because pre-processors don't understand what to do with objects produced by s(), that formula needs to not include s().

So, to make this work, try the following:

library(tidymodels)
library(spatialsample)

gam_spec <- gen_additive_mod(mode = "regression")

workflow() |> 
  add_model(
    gam_spec,
    formula = canopy_area_2019 ~ s(land_area) + s(mean_temp)
  ) |> 
  add_formula(
    canopy_area_2019 ~ land_area + mean_temp
  ) |> 
  fit_resamples(spatial_clustering_cv(boston_canopy))
#> # Resampling results
#> # 10-fold spatial cross-validation 
#> # A tibble: 10 × 4
#>    splits           id     .metrics         .notes          
#>    <list>           <chr>  <list>           <list>          
#>  1 <split [589/93]> Fold01 <tibble [2 × 4]> <tibble [0 × 3]>
#>  2 <split [645/37]> Fold02 <tibble [2 × 4]> <tibble [0 × 3]>
#>  3 <split [637/45]> Fold03 <tibble [2 × 4]> <tibble [0 × 3]>
#>  4 <split [613/69]> Fold04 <tibble [2 × 4]> <tibble [0 × 3]>
#>  5 <split [598/84]> Fold05 <tibble [2 × 4]> <tibble [0 × 3]>
#>  6 <split [614/68]> Fold06 <tibble [2 × 4]> <tibble [0 × 3]>
#>  7 <split [587/95]> Fold07 <tibble [2 × 4]> <tibble [0 × 3]>
#>  8 <split [615/67]> Fold08 <tibble [2 × 4]> <tibble [0 × 3]>
#>  9 <split [622/60]> Fold09 <tibble [2 × 4]> <tibble [0 × 3]>
#> 10 <split [618/64]> Fold10 <tibble [2 × 4]> <tibble [0 × 3]>

Created on 2023-05-03 with reprex v2.0.2

This is wonky and the team knows it's wonky -- there's a related issue on this over at https://github.com/tidymodels/parsnip/issues/770 , and plans to work on better docs & error handling for mgcv models in the future.

Let me know if that helps :smile:

jamesgrecian commented 1 year ago

Thanks @mikemahoney218

I was getting tripped up with my own dataset as I was converting the response to a factor on the fly which tidy models doesn't seem to like - something along the lines of this:

workflow() |> 
  add_model(
    gam_spec,
    formula = factor(canopy_area_2019) ~ s(land_area) + s(mean_temp)
  ) |> 
  add_formula(
    factor(canopy_area_2019) ~ land_area + mean_temp
  ) |> 
  fit_resamples(spatial_clustering_cv(boston_canopy))

If I go back and reformat the response column, then regenerate the spatialsamples::spatial_block_cv object, this runs fine and returns the expected output.

Thanks again,

James

mikemahoney218 commented 1 year ago

Awesome, glad to see it works :smile:

github-actions[bot] commented 1 year ago

This issue has been automatically locked. If you believe you have found a related problem, please file a new issue (with a reprex: https://reprex.tidyverse.org) and link to this issue.