Closed ShreeshaM07 closed 5 months ago
I have thought of 2 possible ways to pass the offset
and exposure
parameters while predict
ing.
Design 1 and Design2 .
Idea behind this is that I initialize the object of the class with whether offset
and exposure
is going to be passed while predicting and will contain boolean values. Since we cannot have an argument offset
and exposure
in the _predict
function user must pass additional columns with names offset
and exposure
which will be removed/dropped while fit
ting and predict_proba
but will be converted to array
and passed to statsmodels
predict
thereby keeping consistency in the number of columns
and rows
.
Pros
offset
and exposure
params even after fitting data.offset
/exposure
values.Cons
columns
with the exact same name of offset
and exposure
each time in the exog
/X
variable.offset
/exposure
.Idea here is that I pass the offset
/exposure
array while initializing the object itself. It must have the same length as the number of rows in the X
that will be passed while predicting.
Pros
X
while fit
ting.Cons
X
passed to predict
must be known while initializing/constructor the object itself.X
needs to be changed the offset
/exposure
must be changed for which it will need to be re-initialized
again.Re-fitting
the model will need to happen each time X
size changes for predict
.X
used for predicting is not known prior.Both the designs above are not a foolproof way but they both give correct answers. Since we cannot add a test setting for both these I am not sure I can think of a better way to do it where it can be added to the test setting too.
A strong design principle in sklearn
-like designs is separating data from model specification, because only that allows reliable model composition.
As Option 2 violates that principle (offset and exposure are part of the data), as your cons imply, I would have a very strong preference towards option 1.
I would vary the idea a little, by adding parameters exposure_var
and offset_var
, which are pandas
index elements or int. If int, it is assumed iloc
; otherwise loc
. There are sensible defaults, I would say None
, which means no exposure/offset is passed (and I believe statsmodels
assumes constant exposure and zero offset then).
Re testing, this will require a separate test added to a glm speific test module.
I would vary the idea a little, by adding parameters
exposure_var
andoffset_var
, which arepandas
index elements or int. If int, it is assumediloc
; otherwiseloc
. There are sensible defaults, I would sayNone
, which means no exposure/offset is passed (and I believestatsmodels
assumes constant exposure and zero offset then).
I am little unsure on what this means. Do you mean to add these 2 parameters exposure_var
and offset_var
along with the already present exposure
and offset
bool parameters as I have done in Design1.
If we are adding along with these then I think there would be no need to have the extra columns in X
that the user has to pass while fitting
,predicting
etc. This would again mean that we are not separating the model specification from the data.
Could you please elaborate the idea a little as to what exactly the exposure_var
and offset_var
will contain.
I am little unsure on what this means. Do you mean to add these 2 parameters
exposure_var
andoffset_var
along with the already presentexposure
andoffset
bool parameters as I have done in
I was suggesting to replace these with two more informative variables, concretely:
Replace exposure
by exposure_var
, type is pandas
index element (e.g., string). If exposure_var = None
, it behaves like your exposure = False
. If exposure_var = "exposure"
, it behaves like your exposure = True
. And exposure_var = "sth_else"
can be used for pointing to another variable.
Could you please elaborate the idea a little as to what exactly the
exposure_var
andoffset_var
will contain.
The type would be a single pandas
index element. For instance, str
. If str
, then it would point ot the column loc
. I also suggest to use integers iloc
rather than loc
.
This would again mean that we are not separating the model specification from the data.
I think it is fine to pass data schema references to the specification, that is different from the data itself (i.e., the entries of the data frame).
Replace
exposure
byexposure_var
, type ispandas
index element (e.g., string). Ifexposure_var = None
, it behaves like yourexposure = False
. Ifexposure_var = "exposure"
, it behaves like yourexposure = True
. Andexposure_var = "sth_else"
can be used for pointing to another variable.
Yea this makes more sense and I have completed implementing it that way and also re ordered the new params to the end. Next we will have to work on adding test setting for it.
great! btw, if you want to move the position of the parameters later, we should follow the "move parameter position" recipe - we can make the change right away.
I've made the changes based on the review please let me know if anything else needs modification.
Reference Issues/PRs
fixes #383 and closes #230
What does this implement/fix? Explain your changes.
This creates an adapter converting the statsmodels GLM families to skpro equivalent giving
GLM
s a broader capability of distributions and link functions.Does your contribution introduce a new dependency? If yes, which one?
No
Did you add any tests for the change?
Yes
PR checklist
For all contributions
skpro
root directory (not theCONTRIBUTORS.md
). Common badges:code
- fixing a bug, or adding code logic.doc
- writing or improving documentation or docstrings.bug
- reporting or diagnosing a bug (get this pluscode
if you also fixed the bug in the PR).maintenance
- CI, test framework, release. See here for full badge referenceFor new estimators
docs/source/api_reference/taskname.rst
, follow the pattern.Examples
section.python_dependencies
tag and ensured dependency isolation, see the estimator dependencies guide.