sktime / skpro

A unified framework for tabular probabilistic regression and probability distributions in python
https://skpro.readthedocs.io/en/latest
BSD 3-Clause "New" or "Revised" License
232 stars 45 forks source link

[BUG] `NGBoostRegressor` failing when `dist="TDistribution"` #291

Open ShreeshaM07 opened 4 months ago

ShreeshaM07 commented 4 months ago

Describe the bug

In the gradent_boosting which has an interface of the NGBRegressor in skpro as NGBoostRegressor the TDistribution seems to be failing to run as expected. It is raising errors like

    raise LinAlgError("Singular matrix")
numpy.linalg.LinAlgError: Singular matrix

To Reproduce

Upon using sklearn's diabetes dataset and the breast_cancer dataset it is producing the same Singular Matrix error. To reproduce

from sklearn.datasets import load_diabetes
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
from skpro.regression.gradient_boosting import NGBoostRegressor

# step 1: data specification
X, y = load_diabetes(return_X_y=True, as_frame=True)
X_train, X_test, Y_train, Y_test = train_test_split(X, y)
ngb = NGBoostRegressor(dist="TDistribution")._fit(X_train, Y_train)
Y_preds = ngb._predict(X_test)

Y_dists = ngb._pred_dist(X_test)

print(Y_dists)
Y_pred_proba = ngb.predict_proba(X_test)
print(Y_pred_proba)

# test Mean Squared Error
test_MSE = mean_squared_error(Y_preds, Y_test)
print('Test MSE', test_MSE)

# test Negative Log Likelihood
test_NLL = -Y_dists.logpdf(Y_test).mean()
print('Test NLL', test_NLL)

Expected behavior

The expected output must look something like this

[iter 0] loss=5.7260 val_loss=0.0000 scale=1.0000 norm=62.6096
[iter 100] loss=5.3862 val_loss=0.0000 scale=1.0000 norm=44.7994
[iter 200] loss=5.1347 val_loss=0.0000 scale=2.0000 norm=70.8354
[iter 300] loss=4.9709 val_loss=0.0000 scale=1.0000 norm=31.4283
[iter 400] loss=4.8448 val_loss=0.0000 scale=2.0000 norm=57.8725
<ngboost.distns.t.TDistribution object at 0x7a306649f010>
TDistribution(columns=Index(['target'], dtype='object'),
       index=Index([394,  76, 398, 154, 164, 409,  86,  57, 248, 252,
       ...
       337,  16, 115, 134, 158, 256, 315,   7, 292, 119],
      dtype='int64', length=111),
       mu=              0
0    204.242902
1    159.767290
2    180.299182
3    157.156834
4    132.029658
..          ...
106  207.598136
107  111.282266
108  142.690431
109   82.266164
110  144.789344

[111 rows x 1 columns],
       sigma=             0
0    22.784403
1    26.722443
2    41.334656
3    32.130065
4    23.862477
..         ...
106  31.425179
107  33.441920
108  24.632183
109  26.791969
110  34.908296

[111 rows x 1 columns])
Test MSE 4077.414567879142
Test NLL 6.473540253400317

Environment

Python 3.11.8 ngboost 0.5.1

Additional context

The issue is to find out whether there is an issue with the interfacing ie the skpro API or genuinely a bug in the ngboost TDistribution itself.

julian-fong commented 3 months ago

I am encountering Singular Matrix errors when doing CI checks for other PRs, wondering if this is related? These are the tests that are failing in #370

FAILED skpro/tests/test_all_estimators.py::TestAllEstimators::test_fit_does_not_overwrite_hyper_params[RandomizedSearchCV-2-ProbaRegressorSurvival] - numpy.linalg.LinAlgError: Singular matrix
FAILED skpro/tests/test_all_estimators.py::TestAllEstimators::test_fit_updates_state[GridSearchCV-2-ProbaRegressorSurvival] - numpy.linalg.LinAlgError: Singular matrix
FAILED skpro/tests/test_all_estimators.py::TestAllEstimators::test_fit_returns_self[RandomizedSearchCV-2-ProbaRegressorSurvival] - numpy.linalg.LinAlgError: Singular matrix
FAILED skpro/tests/test_all_estimators.py::TestAllEstimators::test_fit_does_not_overwrite_hyper_params[GridSearchCV-2-ProbaRegressorSurvival] - numpy.linalg.LinAlgError: Singular matrix
FAILED skpro/tests/test_all_estimators.py::TestAllEstimators::test_fit_updates_state[RandomizedSearchCV-2-ProbaRegressorSurvival] - numpy.linalg.LinAlgError: Singular matrix
FAILED skpro/tests/test_all_estimators.py::TestAllEstimators::test_fit_returns_self[GridSearchCV-2-ProbaRegressorSurvival] - numpy.linalg.LinAlgError: Singular matrix
fkiraly commented 3 months ago

Hm, I think this is due to the CoxPH used in parameter set 2 which is not robust when used on a small dataset.

We could:

julian-fong commented 3 months ago

Do you have a particular preference? I'm not too familiar with survival models so recommendations would be helpful here

fkiraly commented 3 months ago

summarizing ealrier discussion today, any survival model without soft deps and numerically stable on small data should do for the purpose of smooth testing. ResidualDouble with LinearRegression or similar.