Open fkiraly opened 2 months ago
Some points regarding the same
return_std
is not available in case of TweedieRegressor
in the predict method of sklearn so we may not be able to find the value of scale in the cases when the underlying distribution requires it for ex Normal.GLMRegressor
why can we not just interface the Tweedie
distribution and then add it in the family
parameter of GLMRegressor
? Not really sure where we can interface the distribution from though.A doubt regarding the TweedieRegressor
, is it not just an interface to possible regressors for different families for ex Poisson,Gaussian,Gamma ? So then is there any difference in implementing the TweedieRegressor if it is just going to expose these different regressors ?
To answer these:
GLMRegressor
, that interfaces the GLM from statsmodels
. The sklearn
TweedieRegressor
is a completely different object. Of course it would be nice to add support for the Tweedie in statsmodels
, that is a different, useful issue, and may meet the use case of @fsafaro1.scipy
issue discusses the Tweedie distribution: https://github.com/scipy/scipy/issues/11291#issuecomment-1868256070 and concludes that the scipy
interface is not general enough because it is mixed type. skpro
is general enough, so with the pointers in there we could implement it, either entirely from scratch, or interfacing some of the component functions such as Bessel.is it not just an interface to possible regressors for different families for ex Poisson,Gaussian,Gamma
yes, but for non-integer p parameter these are very specific families that are also not available yet. It is a good question whether the distribution should internally decompose in these case distinctions.
this scipy issue discusses the Tweedie distribution: https://github.com/scipy/scipy/issues/11291#issuecomment-1868256070 and concludes that the scipy interface is not general enough because it is mixed type. skpro is general enough, so with the pointers in there we could implement it, either entirely from scratch, or interfacing some of the component functions such as Bessel.
From the conversation I can infer that we can implement this in skpro
as it allows for mixed type distributions with pdf
and pmf
in different intervals. https://lorentzen.ch/index.php/2024/06/17/a-tweedie-trilogy-part-iii-from-wrights-generalized-bessel-function-to-tweedies-compound-poisson-distribution/ seems to be a very informative post explaining the Tweedie distribution. It also gives code snippet for the pdf and pmf of the function compound poisson and gamma function.
import numpy as np
from scipy.special import wright_bessel
def cpg_pmf(mu, phi, p):
"""Compound Poisson Gamma point mass at zero."""
return np.exp(-np.power(mu, 2 - p) / (phi * (2 - p)))
def cpg_pdf(x, mu, phi, p):
"""Compound Poisson Gamma pdf."""
if not (1 < p < 2):
raise ValueError("1 < p < 2 required")
theta = np.power(mu, 1 - p) / (1 - p)
kappa = np.power(mu, 2 - p) / (2 - p)
alpha = (2 - p) / (1 - p)
t = ((p - 1) * phi / x)**alpha
t /= (2 - p) * phi
a = 1 / x * wright_bessel(-alpha, 0, t)
return a * np.exp((x * theta - kappa) / phi)
This can be utilized along with the usage of the wright_bessel
function in scipy.special
.
for the sklearn Tweedie regressor, the remaining quesiton is still where to get the scale from. It would not be much of a Tweedie regressor if tha twould be impossible to obtain...
I think there is a very round about way to do this by passing the x value to PoissonRegressor
and GammaRegressor
separately and finding out the values of lambda
,a
and b
.
As we know the mean=return of predict
we know p
power parameter is fixed. We can calculate phi
or scale
using the formula below . Is it not possible that way?
Some thought on the Tweedie Distribution
Normal
when pw=0
where pw
is the power parameter, call pdf of Poisson
when pw=1
, pdf of Gamma when pw=2
and call the code snippet in the above comment when p is in (1,2)
From the conversation I can infer that we can implement this in skpro as it allows for mixed type distributions with
pmf
in different intervals.
Yes, assuming you mean the p
parameter. In places where the distribution is entirely discrete or continuous, the pdf
or pmf
will return zero.
Further, here's an interesting option, since multiple already implemented distributions figure as special cases:
Tweedie
as a _DelegatedDistribution
and delegate to one of the Tweedie ED families depending on p
._DelegatedDistribution
to delegate private, not public methods. This could be done in a separate PR - the current delegator delegates public methodsHere is an illustration of the suggested delegator approach: (Tweedie is a delegator compound of Tweedie ED families)
Opened new issue on Tweedie distribution here, as that does not seem too straightforward - for further discussion. https://github.com/sktime/skpro/issues/429
We should try to interface
TweedieRegressor
fromsklearn
as anskpro
regressor. https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.TweedieRegressor.htmlNotes on implementation:
return_std
interface, but we can use_prep_skl_df
.Tweedie
distribution inskpro
, currently it is not implemented.Tweedie
has three parameters: power, location, scale. Power is set fixed in thesklearn
TweedieRegressor
, location is returned bypredict
, but it is unclear whether scale can be obtained from it. Perhaps @fsaforo1 has insight on this point.FYI @ShreeshaM07, this is very similar to your previous work on
statsmodels
GLM!