neural-structured-additive-learning / deepregression

GNU General Public License v3.0
12 stars 5 forks source link

Generating one parameter per `factor` level of categorical predictors #28

Open maarten-jung opened 9 months ago

maarten-jung commented 9 months ago

When trying to setup a deepregression::deepregression model with one parameter per level of a factor, I ran into the problems illustrated below.

Running

library(deepregression)

set.seed(42)

n <- 100
y <- rnorm(n)
x <- gl(10, 10)
d <- data.frame(x)

m_dr_1 <- deepregression(y = y,
                         list_of_formulas = list(
                           location = ~ x,
                           scale = ~ 1),
                         data = d)

seems to work fine and generates a model with 1 intercept and 9 additional coefficients for each of the non-reference levels:

coef(m_dr_1)
# $x
# [,1]
# [1,] -0.47599542
# [2,] -0.73255873
# [3,] -0.62333083
# [4,] -0.40610617
# [5,] -0.43092820
# [6,] -0.54474354
# [7,]  0.67287409
# [8,]  0.73756588
# [9,]  0.09474695
# 
# $`(Intercept)`
# [,1]
# [1,] 1.244688

But with the usual ~ 0 + x or -1 + x syntax, we only get 9 coefficients instead of the usual 10 (one per factor level) coefficients that I expected:

m_dr_2 <- deepregression(y = y,
                         list_of_formulas = list(
                           location = ~ 0 + x,
                           scale = ~ 1),
                         data = d)
coef(m_dr_2)
# $x
# [,1]
# [1,] -0.47599542
# [2,] -0.73255873
# [3,] -0.62333083
# [4,] -0.40610617
# [5,] -0.43092820
# [6,] -0.54474354
# [7,]  0.67287409
# [8,]  0.73756588
# [9,]  0.09474695

m_dr_3 <- deepregression(y = y,
                         list_of_formulas = list(
                           location = ~ -1 + x,
                           scale = ~ 1),
                         data = d)
coef(m_dr_3)
# $x
# [,1]
# [1,] -0.47599542
# [2,] -0.73255873
# [3,] -0.62333083
# [4,] -0.40610617
# [5,] -0.43092820
# [6,] -0.54474354
# [7,]  0.67287409
# [8,]  0.73756588
# [9,]  0.09474695

Setting check_form = FALSE in the formula_options leads to the following errors:

m_dr_4 <- deepregression(y = y,
                         list_of_formulas = list(
                           location = ~ 0 + x,
                           scale = ~ 1),
                         formula_options = list(precalculate_gamparts = TRUE,
                                                check_form = FALSE),
                         data = d)

# Error in py_call_impl(callable, call_args$unnamed, call_args$named) : 
#   ValueError: The name "input__Intercept__2" is used 2 times in the model. All layer names should be unique.

m_dr_5 <- deepregression(y = y,
                         list_of_formulas = list(
                           location = ~ -1 + x,
                           scale = ~ 1),
                         formula_options = list(precalculate_gamparts = TRUE,
                                                check_form = FALSE),
                         data = d)

# Error in py_call_impl(callable, call_args$unnamed, call_args$named) : 
#   ValueError: '-1_1/' is not a valid root scope name. A root scope name has to match the following pattern: ^[A-Za-z0-9.][A-Za-z0-9_.\\/>-]*$

The problems seem to be caused by the formula processing in deepregression:::process_terms and deepregression:::separate_define_relation, but I didn't feel comfortable editing the code in a way that respects the current structure and so didn't create a pull request.

For completeness, this is the corresponding sessionInfo()

R version 4.3.2 (2023-10-31 ucrt)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 19045)

Matrix products: default

locale:
[1] LC_COLLATE=German_Germany.utf8  LC_CTYPE=German_Germany.utf8    LC_MONETARY=German_Germany.utf8 LC_NUMERIC=C                   
[5] LC_TIME=German_Germany.utf8    

time zone: Europe/Berlin
tzcode source: internal

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] deepregression_2.0.0 keras_2.13.0         tfprobability_0.15.1 tensorflow_2.14.0   

loaded via a namespace (and not attached):
 [1] rappdirs_0.3.3    utf8_1.2.4        generics_0.1.3    lattice_0.21-9    hms_1.1.3         magrittr_2.0.3    grid_4.3.2        rprojroot_2.0.4  
 [9] jsonlite_1.8.7    Matrix_1.6-2      processx_3.8.2    progress_1.2.2    whisker_0.4.1     torch_0.11.0      ps_1.7.5          tfruns_1.5.1     
[17] mgcv_1.9-0        fansi_1.0.5       coro_1.0.3        cli_3.6.1         rlang_1.1.2       crayon_1.5.2      bit64_4.0.5       splines_4.3.2    
[25] withr_2.5.2       base64enc_0.1-3   luz_0.4.0         tools_4.3.2       dplyr_1.1.3       zeallot_0.1.0     here_1.0.1        reticulate_1.34.0
[33] vctrs_0.6.4       R6_2.5.1          png_0.1-8         lifecycle_1.0.4   fs_1.6.3          bit_4.0.5         pkgconfig_2.0.3   callr_3.7.3      
[41] pillar_1.9.0      torchvision_0.5.1 glue_1.6.2        Rcpp_1.0.11       tibble_3.2.1      tidyselect_1.2.0  rstudioapi_0.15.0 nlme_3.1-163 
davidruegamer commented 9 months ago

Thanks for pointing this out! I still try to remember why we did it like this, because we had a similar issue in the past with categorical features and no included intercept. Maybe it's related to the case when there is a bias term in the unstructured part / deep network and hence estimating the effects of all categories would cause an identifiability issue. Will think about it a bit more...