New model request: gpboost Tree-Boosting with Gaussian Process and Mixed Effects Models

schelhorn commented 2 years ago

The gpboost package on CRAN by @fabsig explains itself as such:

Combining Tree-Boosting with Gaussian Process and Mixed Effects Models An R package that allows for combining tree-boosting with Gaussian process and mixed effects models. It also allows for independently doing tree-boosting as well as inference and prediction for Gaussian process and mixed effects models. See https://github.com/fabsig/GPBoost for more information on the software and Sigrist (2020) <arXiv:2004.02653> and Sigrist (2021) <arXiv:2105.08966> for more information on the methodology.

I would suggest that it would make a nice extension to {multilevelmod} due to its ability to model non-linear relationships and work well with high-cardinality categorical data.

From the paper abstract of the approach:

We introduce a novel way to combine boosting with Gaussian process and mixed effects models. This allows for relaxing, first, the zero or linearity assumption for the prior mean function in Gaussian process and grouped random effects models in a flexible non-parametric way and, second, the independence assumption made in most boosting algorithms. The former is advantageous for prediction accuracy and for avoiding model misspecifications. The latter is important for efficient learning of the fixed effects predictor function and for obtaining probabilistic predictions. Our proposed algorithm is also a novel solution for handling high-cardinality categorical variables in tree-boosting. In addition, we present an extension that scales to large data using a Vecchia approximation for the Gaussian process model relying on novel results for covariance parameter inference. We obtain increased prediction accuracy compared to existing approaches on multiple simulated and real-world data sets.

And the main text of the paper:

In summary, both the linearity assumption in Gaussian process models and the independence assumption in boosting are often questionable. The goal of this article is to relax these restrictive assumptions by combining boosting with Gaussian process and mixed effects models. Specifically, we propose to model the mean function using an ensemble of base learners, such as regression trees (Breiman et al., 1984), learned in a stage-wise manner using boosting, and the second-order structure is modeled using a Gaussian process or mixed effects model. In doing so, the parameters of the covariance function are estimated jointly with the mean function; see Section 2 for more details.

The paper is very well written and the package is actively developed on Github, with the last commit from two months ago. Multiple usage examples are linked here, the most comprehensive being this one. Model hyperparameters are explained here.

From the documention, I believe it can work with the following responses: regression, regression_l1, huber, binary, lambdarank, multiclass

fabsig commented 2 years ago

@schelhorn: many thanks for this suggestion!

Just a small clarification: currently, GPBoost supports the following response distributions: gaussian, bernoulli_probit (= binary), bernoulli_logit, poisson, gamma; see here for a list of currently supported likelihoods.

hfrick commented 1 year ago

Thank you for the detailed issue with the references 🙌 It's sitting here until the next round of triaging/implementing new models but it hasn't fallen off the radar.

tdemarchin commented 9 months ago

Hi, Upvoting this as I would be very interested to have GPboost included in the tidymodels panel.

tidymodels / multilevelmod

New model request: gpboost Tree-Boosting with Gaussian Process and Mixed Effects Models #47