sktime / skpro

A unified framework for tabular probabilistic regression, time-to-event prediction, and probability distributions in python
https://skpro.readthedocs.io/en/latest
BSD 3-Clause "New" or "Revised" License
249 stars 46 forks source link

[BUG] ResidualDouble regressor does not handle transforms correctly. #492

Open meh2135 opened 3 days ago

meh2135 commented 3 days ago

Describe the bug In skpro.regression.residual.ResidualDouble, for the argument residual_trafo="squared", we should be taking the sqrt of the scale estimate before feeding it into the scale parameters, possibly followed by a further, distribution specific correction.

For instance, if we're using an error dist X~T(mu=0, dof=v) distribution, then for abs E[|X|] = ComplicatedFunc(dof) sigma, and E[X^2]=Var(X)=sigma^2 v / (v-2)

derivation from stack exchange

For arbitrary transformations, this quickly gets tricky, even for those limited distributions. Further, arbitrary transformations can't easily be checked for monotonicity or positivity. I would advise removing that capability.

To Reproduce

Running the following should yield uniform quantiles, but that totally breaks down for the squared transform.

from sklearn.model_selection import KFold
from matplotlib import pyplot as plt
import lightgbm as lgb
import numpy as np
import seaborn as sns
import pandas as pd
nn = 100_000
x_df = pd.DataFrame({"a":np.random.randn(nn), "b":np.random.randn(nn), "c":np.random.randn(nn)}).clip(-2,2)
degrees_of_freedom = 3.5
var_adj = (degrees_of_freedom / (degrees_of_freedom - 2))
# DGP
loc_param_vec = pd.Series({"a":-1, "b":1, "c":0})
log_scale_param_vec = pd.Series({"a":0, "b":0.01, "c":0.5})
loc_vec = x_df.dot(loc_param_vec)
log_scale_vec = x_df.dot(log_scale_param_vec)
dist = stats.t(df=degrees_of_freedom, loc=loc_vec, scale=np.exp(log_scale_vec))
y = pd.DataFrame(dist.rvs((2, nn)).T, index=x_df.index, columns=["y0", "y1"])
abs_reg = ResidualDouble(estimator=lgb.LGBMRegressor(verbosity=-1), estimator_resid=lgb.LGBMRegressor(verbosity=-1), distr_params={"df": degrees_of_freedom}, distr_type="t", residual_trafo="absolute", cv=KFold(n_splits=3))
sq_reg = ResidualDouble(estimator=lgb.LGBMRegressor(verbosity=-1), estimator_resid=lgb.LGBMRegressor(verbosity=-1), distr_params={"df": degrees_of_freedom}, distr_type="t", residual_trafo="squared", cv=KFold(n_splits=3))

abs_reg.fit(x_df, y["y0"])
sq_reg.fit(x_df, y["y0"])
abs_pred = abs_reg.predict_proba(x_df)
sq_pred = sq_reg.predict_proba(x_df)
abs_cdf = abs_pred.cdf(y[["y1"]])["y0"]
sq_cdf = sq_pred.cdf(y[["y1"]])["y0"]
cdf_df = pd.DataFrame({"abs": abs_cdf, "sq": sq_cdf})
sns.histplot(cdf_df, multiple="dodge", bins=10)

Confusion

I don't understand why it does yield uniform quantiles for the absolute transform for the t distribution. I would expect them to deviate from uniform because there is no degrees of freedom correction. Can anyone explain why that works?

Tacking the following code onto the end of the block above, we see that "raw" is the most uniform. That makes no sense to me.


from scipy.special import gamma
def abs_t_correction(dof):
    return 2 *np.sqrt(dof - 2) * gamma((dof + 1) / 2) / (np.sqrt(np.pi) * (dof - 1) * gamma(dof / 2))

dist = stats.t(df=degrees_of_freedom, loc=loc_vec, scale=np.exp(log_scale_vec))
dist_div = stats.t(df=degrees_of_freedom, loc=loc_vec, scale=np.exp(log_scale_vec) / abs_t_correction(degrees_of_freedom))
dist_mult = stats.t(df=degrees_of_freedom, loc=loc_vec, scale=np.exp(log_scale_vec) * abs_t_correction(degrees_of_freedom))
all_3_cdf = abs_pred.cdf(pd.DataFrame({"div":dist_div.rvs(nn), "mult":dist_mult.rvs(nn), "raw":dist.rvs(nn)}, columns=["div", "mult", "raw"]))
all_3_cdf.columns = ["div", "mult", "raw"]
sns.histplot(all_3_cdf, multiple="dodge", bins=10)
meh2135 commented 3 days ago

Working on a fix here: https://github.com/meh2135/skpro

meh2135 commented 3 days ago

After thinking about it a bit, I'm even more convinced that there really isn't a good way of reasonably estimating a scale parameter by allowing the user to specify an arbitrary transform of the residual, and modeling that.

I'm going to remove that option in my PR.

Re the adjustment for absolute for student's t, you can see the issue here:

from skpro.distributions import TDistribution
from scipy import stats
from matplotlib import pyplot as plt
import numpy as np
n_samples = 40_000
dof_vec = np.linspace(2.1, 7.0, 20)
ratio_vec = []
for dof in dof_vec:
    # Generate data from a t-distribution
    gen_dist = stats.t(df=dof, loc=0.0, scale=2.0)
    gen_samp = gen_dist.rvs(n_samples)
    # Estimate sigma from the data using the naive approach used in ResidualDouble, ie sigma = mean(abs(y))
    naive_sigma_est = np.abs(gen_samp).mean()
    # Plug that in to a skpro t-distribution'
    naive_skpro_dist = TDistribution(df=dof, mu=0.0, sigma=naive_sigma_est)
    # Sample from that skpro distribution
    naive_skpro_samples = naive_skpro_dist.sample(n_samples)
    # Calculate the variance of the data and the variance of the skpro samples
    gen_var = gen_samp.var()
    naive_skpro_var = naive_skpro_samples[0].var()
    # print(f"Naive sigma estimate: {naive_sigma_est:.4f}")
    # print(f"Naive skpro var: {naive_skpro_var:.4f}")
    # print(f"Ratio of variances: {naive_skpro_var / gen_var:.4f}")
    ratio_vec.append(naive_skpro_var / gen_var)

plt.figure()
plt.plot(dof_vec, ratio_vec)

plt.figure()
plt.plot(dof_vec, np.sqrt(ratio_vec))

I don't see the correction being made anywhere in the ResidualDouble code. That said, the quantiles that come out of it look decent, so maybe I am missing something.

FYI: @fkiraly

fkiraly commented 1 day ago

@meh2135, thanks a lot for your report and work on this!

I agree that the inverse transform is missing in _predict - I think we should add this for all cases. This is very astutely observed, really great!

However, I would split work on adding back the missing inverse transform from modifications of the algorithm.

There is a simple software engineering reason for that, we should not change anything in the package that may break user code - at least not without giving warnings for a few months and then changing it. Here is the change/deprecation management guide: https://www.sktime.net/en/latest/developer_guide/deprecation.html

Removing an option would be such a change that could break code of users that use the option.

On the content side, that is, regarding the algorithm itself, I get what you are saying about violated statistical assumptions, we should keep in mind though that the algorithm is a simple heuristic, and also a common "rule of thumb" estimator. Of course there are better ways to estimate the residual distribution, under statistical assumptions.

For this, I would recommend discussing more systematically the target state first. Perhaps writing down a mock docstring of what you think the "best end state" should be. Because if you implement first, we might end up writing unnecessary code if in the end we agree on a different docstring.

fkiraly commented 1 day ago

I don't understand why it does yield uniform quantiles for the absolute transform for the t distribution. I would expect them to deviate from uniform because there is no degrees of freedom correction. Can anyone explain why that works?

Regarding this, I think this is simply a model mismatch issue. Your data were not t-distributed residuals to start with, so it may or may not be that a deviation from an incorrect assumption (t-distribution with certain df) leads to a better fit!

That is also related to what I was saying about this algorithm being a heuristic. In case of model mismatch - and in real life nothing ever is exactly normally or t-distributed (for single samples, i.e., large sample statistics approximately do follow limit distributions etc). So, if you have model mismatch anyway, one or the other version of a heuristic with model mistmatch may be better.

The "machine learner" approach would be to take algorithms as they are and not be too concerned with "wrong statistical assumptions", but do stringent statistics for empirical performance evaluation.