[LRNRQ] Add gam from package mgcv

vviers commented 3 years ago

Algorithm

Generalised Additive Model

Package

mgvc

Supported types

[x] classif
[x] regr
[x] surv
[ ] dens

I have checked that this is not already implemented in

[x] mlr3
[x] mlr3learners
[x] mlr3extralearners
[x] Other core packages (e.g. mlr3proba)

Comments

We have started to implement this learner for our own use here : Signaux Faibles - mlr3extralearners - feat/gam_learner and were wondering if you would be interested in getting a PR from us ? We know that classif.gamboost essentially fits a GAM too (right?) but we really like the mgvc implementation.

RaphaelS1 commented 3 years ago

We have started to implement this learner for our own use here : Signaux Faibles - mlr3extralearners - feat/gam_learner and were wondering if you would be interested in getting a PR from us ?

Hey, just took a quick look at your fork, looks like you've already done the hard work so definitely interested in adding your learner!

We know that classif.gamboost essentially fits a GAM too (right?) but we really like the mgvc implementation.

Well classif.gamboost is probably the boosted version of what you want to implement. But it is definitely a good idea just to have straight gams as well so this is a great addition.

You marked that the learner is available for regr and surv as well, would you be interested in implementing these two? There's no requirement to do so but it's always nice to have the 'complete set' when possible.

If you do want to go ahead and open a PR please just make sure you've also got the other required files as well (unit tests, parameter tests, yml...) and that you're happy to continue maintaining the learner going forward.

pierrecamilleri commented 3 years ago

Hi, thank you for your hints, I will soon work on this, and would be glad to maintain the learner afterward.

I have a question concerning the implementation: mgcv::gam is mainly parametrized through a formula, which controls the smoothers used for each individual feature, possible interactions etc.

What would be your recommandation for passing this information to mlr3 ? Is it OK to pass the formula as ParamUty to the learner ? The problem is that the formula is mixing information about the task (feature and target names), and the learner (smoothing parameters). I checked on other learners and I haven't found any using a formula (when available, feature matrices are used instead).

I have seen that there is a "formula()" method for a task that seems not to do much more than reconstructing a formula from a list of features, so I don't think it is of any help here. It can even be confusing in the case that the learner uses a different formula than the default one given by task$formula().

I have also found this open issue (that even links to a mlr-issue about mgcv::gam), in which a limitation of including the formula in the learner is mentionned: A benchmark with one or multiple learners and multiple tasks would fail because the data set in one task might be not compatible with the formula specified in the learner. Concerns are also raised about the tuning of the formula.

However, the final thoughts seem to point towards the integraton of the formula as a parameter of the learner. Can I do it this way ?

RaphaelS1 commented 3 years ago

Hi @JazzyPierrot this is indeed a discussion we've been having internally, though I note that open issue is 1.5 years old and with no discussion.

Pinging @adibender and @berndbischl here as we discussed this recently and I want to highlight this learner as another good example for when a separated 'formula' interface may be useful/required in a learner.

For now I can suggest one of the following options:

As suggested by you, use ParamUty for formula and give this the required property
Use the usual mlr3 interface with a task and then add parameters to the learner for smoothing e.g. for smoothing function s have a ParamUty which would contain the name of variables to wrap in the s function, so internally say you've named the parameter set, ps, then mgcv::s(ps$values$s) (and add this to the formula in .train)

berndbischl commented 3 years ago

Pinging @adibender and @berndbischl here as we discussed this recently and I want to highlight this learner as another good example for when a separated 'formula' interface may be useful/required in a learner.

thx. if we want this we need to discuss this in the engineering call with @mllg

(i meant: the formula interface in general)

pierrecamilleri commented 3 years ago

Well noted, thank you for you feedback ! And good luck with the formula interface.

adibender commented 3 years ago

I think if one can not specify the model formula directly, this would be a major hindrance for usability. Imagine you want to specify y ~ s(x1, k = 20, bs = "ad") + s(x2, bs = "re") + te(x3, x4, k = 10) + x5 + x6 + x5*x6 + s(x7, by = x8) .... and this is just the tip of the iceberg. There are so many formula specials and options to specify a model in mgcv. I think specifying this via params and then pasting it together internally will create more overhead than necessary and it is impossible to anticipate all possible specifications. For the user specifying the model would become majorly annoying. In practice, I usually don't want to tune parameters of the GAM, I am mostly interested whether my specification of the model (or couple of them), perform better than other algos. But I'm not interested in tuning it, outside of maybe method, optimizer, gamma, etc., which are all parameters outside of the formula.

RaphaelS1 commented 3 years ago

Classif and regr closed in #61. Surv in #63

mlr-org / mlr3extralearners