[ENH] `GLM` with multiple `distributions` and `link` function support

ShreeshaM07 commented 6 months ago

Reference Issues/PRs

fixes #383 and closes #230

What does this implement/fix? Explain your changes.

This creates an adapter converting the statsmodels GLM families to skpro equivalent giving GLMs a broader capability of distributions and link functions.

Does your contribution introduce a new dependency? If yes, which one?

No

Did you add any tests for the change?

Yes

PR checklist

For all contributions

[ ] I've added myself to the list of contributors with any new badges I've earned :-) How to: add yourself to the all-contributors file in the skpro root directory (not the CONTRIBUTORS.md). Common badges: code - fixing a bug, or adding code logic. doc - writing or improving documentation or docstrings. bug - reporting or diagnosing a bug (get this plus code if you also fixed the bug in the PR).maintenance - CI, test framework, release. See here for full badge reference
[x] The PR title starts with either [ENH], [MNT], [DOC], or [BUG]. [BUG] - bugfix, [MNT] - CI, test framework, [ENH] - adding or improving code, [DOC] - writing or improving documentation or docstrings.

For new estimators

[ ] I've added the estimator to the API reference - in docs/source/api_reference/taskname.rst, follow the pattern.
[ ] I've added one or more illustrative usage examples to the docstring, in a pydocstyle compliant Examples section.
[ ] If the estimator relies on a soft dependency, I've set the python_dependencies tag and ensured dependency isolation, see the estimator dependencies guide.

ShreeshaM07 commented 5 months ago

I have thought of 2 possible ways to pass the offset and exposure parameters while predicting. Design 1 and Design2 .

Design 1

Idea behind this is that I initialize the object of the class with whether offset and exposure is going to be passed while predicting and will contain boolean values. Since we cannot have an argument offset and exposure in the _predict function user must pass additional columns with names offset and exposure which will be removed/dropped while fitting and predict_proba but will be converted to array and passed to statsmodels predict thereby keeping consistency in the number of columns and rows. Pros

It will allow user to keep changing the offset and exposure params even after fitting data.
User doesn't have to keep re initializing the model in case they want to change the offset/exposure values.

Cons

User will have to create 2 extra columns with the exact same name of offset and exposure each time in the exog/X variable.
No way to add a test setting to check this as it cannot add extra columns with the name offset/exposure.

Design 2

Idea here is that I pass the offset/exposure array while initializing the object itself. It must have the same length as the number of rows in the X that will be passed while predicting. Pros

Simple code :)
User doesn't have to struggle having to modify the X while fitting.

Cons

shape of X passed to predict must be known while initializing/constructor the object itself.
If the size of prediction X needs to be changed the offset/exposure must be changed for which it will need to be re-initialized again.
Re-fitting the model will need to happen each time X size changes for predict.
Even for this method there is no way to add a test setting as size of X used for predicting is not known prior.

My Conclusion

Both the designs above are not a foolproof way but they both give correct answers. Since we cannot add a test setting for both these I am not sure I can think of a better way to do it where it can be added to the test setting too.

fkiraly commented 5 months ago

A strong design principle in sklearn-like designs is separating data from model specification, because only that allows reliable model composition.

As Option 2 violates that principle (offset and exposure are part of the data), as your cons imply, I would have a very strong preference towards option 1.

I would vary the idea a little, by adding parameters exposure_var and offset_var, which are pandas index elements or int. If int, it is assumed iloc; otherwise loc. There are sensible defaults, I would say None, which means no exposure/offset is passed (and I believe statsmodels assumes constant exposure and zero offset then).

Re testing, this will require a separate test added to a glm speific test module.

ShreeshaM07 commented 5 months ago

I would vary the idea a little, by adding parameters exposure_var and offset_var, which are pandas index elements or int. If int, it is assumed iloc; otherwise loc. There are sensible defaults, I would say None, which means no exposure/offset is passed (and I believe statsmodels assumes constant exposure and zero offset then).

I am little unsure on what this means. Do you mean to add these 2 parameters exposure_var and offset_var along with the already present exposure and offset bool parameters as I have done in Design1. If we are adding along with these then I think there would be no need to have the extra columns in X that the user has to pass while fitting,predicting etc. This would again mean that we are not separating the model specification from the data.

Could you please elaborate the idea a little as to what exactly the exposure_var and offset_var will contain.

fkiraly commented 5 months ago

I am little unsure on what this means. Do you mean to add these 2 parameters exposure_var and offset_var along with the already present exposure and offset bool parameters as I have done in

I was suggesting to replace these with two more informative variables, concretely:

Replace exposure by exposure_var, type is pandas index element (e.g., string). If exposure_var = None, it behaves like your exposure = False. If exposure_var = "exposure", it behaves like your exposure = True. And exposure_var = "sth_else" can be used for pointing to another variable.

Could you please elaborate the idea a little as to what exactly the exposure_var and offset_var will contain.

The type would be a single pandas index element. For instance, str. If str, then it would point ot the column loc. I also suggest to use integers iloc rather than loc.

This would again mean that we are not separating the model specification from the data.

I think it is fine to pass data schema references to the specification, that is different from the data itself (i.e., the entries of the data frame).

ShreeshaM07 commented 5 months ago

Replace exposure by exposure_var, type is pandas index element (e.g., string). If exposure_var = None, it behaves like your exposure = False. If exposure_var = "exposure", it behaves like your exposure = True. And exposure_var = "sth_else" can be used for pointing to another variable.

Yea this makes more sense and I have completed implementing it that way and also re ordered the new params to the end. Next we will have to work on adding test setting for it.

fkiraly commented 5 months ago

great! btw, if you want to move the position of the parameters later, we should follow the "move parameter position" recipe - we can make the change right away.

ShreeshaM07 commented 5 months ago

I've made the changes based on the review please let me know if anything else needs modification.

sktime / skpro