sktime / skpro

A unified framework for tabular probabilistic regression, time-to-event prediction, and probability distributions in python
https://skpro.readthedocs.io/en/latest
BSD 3-Clause "New" or "Revised" License
250 stars 46 forks source link

[ENH] `GLM` with multiple `distributions` and `link` function support #384

Closed ShreeshaM07 closed 5 months ago

ShreeshaM07 commented 6 months ago

Reference Issues/PRs

fixes #383 and closes #230

What does this implement/fix? Explain your changes.

This creates an adapter converting the statsmodels GLM families to skpro equivalent giving GLMs a broader capability of distributions and link functions.

Does your contribution introduce a new dependency? If yes, which one?

No

Did you add any tests for the change?

Yes

PR checklist

For all contributions
For new estimators
ShreeshaM07 commented 5 months ago

I have thought of 2 possible ways to pass the offset and exposure parameters while predicting. Design 1 and Design2 .

Design 1

Idea behind this is that I initialize the object of the class with whether offset and exposure is going to be passed while predicting and will contain boolean values. Since we cannot have an argument offset and exposure in the _predict function user must pass additional columns with names offset and exposure which will be removed/dropped while fitting and predict_proba but will be converted to array and passed to statsmodels predict thereby keeping consistency in the number of columns and rows. Pros

Cons

Design 2

Idea here is that I pass the offset/exposure array while initializing the object itself. It must have the same length as the number of rows in the X that will be passed while predicting. Pros

Cons

My Conclusion

Both the designs above are not a foolproof way but they both give correct answers. Since we cannot add a test setting for both these I am not sure I can think of a better way to do it where it can be added to the test setting too.

fkiraly commented 5 months ago

A strong design principle in sklearn-like designs is separating data from model specification, because only that allows reliable model composition.

As Option 2 violates that principle (offset and exposure are part of the data), as your cons imply, I would have a very strong preference towards option 1.

I would vary the idea a little, by adding parameters exposure_var and offset_var, which are pandas index elements or int. If int, it is assumed iloc; otherwise loc. There are sensible defaults, I would say None, which means no exposure/offset is passed (and I believe statsmodels assumes constant exposure and zero offset then).

Re testing, this will require a separate test added to a glm speific test module.

ShreeshaM07 commented 5 months ago

I would vary the idea a little, by adding parameters exposure_var and offset_var, which are pandas index elements or int. If int, it is assumed iloc; otherwise loc. There are sensible defaults, I would say None, which means no exposure/offset is passed (and I believe statsmodels assumes constant exposure and zero offset then).

I am little unsure on what this means. Do you mean to add these 2 parameters exposure_var and offset_var along with the already present exposure and offset bool parameters as I have done in Design1. If we are adding along with these then I think there would be no need to have the extra columns in X that the user has to pass while fitting,predicting etc. This would again mean that we are not separating the model specification from the data.

Could you please elaborate the idea a little as to what exactly the exposure_var and offset_var will contain.

fkiraly commented 5 months ago

I am little unsure on what this means. Do you mean to add these 2 parameters exposure_var and offset_var along with the already present exposure and offset bool parameters as I have done in

I was suggesting to replace these with two more informative variables, concretely:

Replace exposure by exposure_var, type is pandas index element (e.g., string). If exposure_var = None, it behaves like your exposure = False. If exposure_var = "exposure", it behaves like your exposure = True. And exposure_var = "sth_else" can be used for pointing to another variable.

Could you please elaborate the idea a little as to what exactly the exposure_var and offset_var will contain.

The type would be a single pandas index element. For instance, str. If str, then it would point ot the column loc. I also suggest to use integers iloc rather than loc.

This would again mean that we are not separating the model specification from the data.

I think it is fine to pass data schema references to the specification, that is different from the data itself (i.e., the entries of the data frame).

ShreeshaM07 commented 5 months ago

Replace exposure by exposure_var, type is pandas index element (e.g., string). If exposure_var = None, it behaves like your exposure = False. If exposure_var = "exposure", it behaves like your exposure = True. And exposure_var = "sth_else" can be used for pointing to another variable.

Yea this makes more sense and I have completed implementing it that way and also re ordered the new params to the end. Next we will have to work on adding test setting for it.

fkiraly commented 5 months ago

great! btw, if you want to move the position of the parameters later, we should follow the "move parameter position" recipe - we can make the change right away.

ShreeshaM07 commented 5 months ago

I've made the changes based on the review please let me know if anything else needs modification.