scikit-learn-contrib / MAPIE

A scikit-learn-compatible module to estimate prediction intervals and control risks based on conformal predictions.
https://mapie.readthedocs.io/en/latest/
BSD 3-Clause "New" or "Revised" License
1.2k stars 99 forks source link

MapieQuantileRegressor - predict method causing MemoryError #410

Closed gabrieltardochi closed 4 months ago

gabrieltardochi commented 5 months ago

Describe the bug I was following the tutorial for conformalized quantile regression (CQR) and got MemoryErrors because of some very big arrays being created in the .predict() method. To Reproduce Can't share my private dataset here, but I simply followed the tutorial.

import numpy as np
import pandas as pd
from lightgbm import LGBMRegressor
from mapie.regression import MapieQuantileRegressor
from sklearn.model_selection import (
    train_test_split
)

random_state = 26
rng = np.random.default_rng(random_state)

X_cols = [
    "day_of_week",
    "captured_lag1",
    "captured_lag2",
    "captured_lag3",
    "captured_lag4",
    "captured_lag5",
    "captured_lag6",
    "captured_lag7",
    "captured_lag8",
]
y_col = "target"

X = train_data[X_cols]
y = train_data[y_col]

X_train, X_test, y_train, y_test = train_test_split(
    X,
    y,
    train_size=0.8,
    random_state=random_state,
    shuffle=False,
)
X_train, X_calib, y_train, y_calib = train_test_split(
    X_train,
    y_train,
    train_size=0.8,
    random_state=random_state,
    shuffle=False,
)

estimator = LGBMRegressor(
    objective="quantile", alpha=0.5, random_state=random_state
)
estimator.fit(X_train, y_train)

mqr = MapieQuantileRegressor(
    estimator, method="quantile", cv="split", alpha=0.02
)
mqr.fit(
    X_train,
    y_train,
    X_calib=X_calib,
    y_calib=y_calib,
    random_state=random_state,
)
y_pred, y_pis = mqr.predict(X_test)

Expected behavior Expected to get the predicted value and upper/lower bounds of the prediction interval correctly.

Screenshots Error:

{
    "name": "MemoryError",
    "message": "Unable to allocate 149. GiB for an array with shape (400000, 400000) and data type bool",
    "stack": "---------------------------------------------------------------------------
MemoryError                               Traceback (most recent call last)
Cell In[18], line 58
     48 mqr = MapieQuantileRegressor(
     49     estimator, method=\"quantile\", cv=\"split\", alpha=0.25
     50 )
     51 mqr.fit(
     52     X_train,
     53     y_train,
   (...)
     56     random_state=random_state,
     57 )
---> 58 y_pred, y_pis = mqr.predict(X_test)

File ~/GitHub/cqr-project/.venv/lib/python3.8/site-packages/mapie/regression/quantile_regression.py:710, in MapieQuantileRegressor.predict(self, X, ensemble, alpha, optimize_beta, allow_infinite_bounds, symmetry)
    708 y_pred_low = y_preds[0][:, np.newaxis] - quantile[0]
    709 y_pred_up = y_preds[1][:, np.newaxis] + quantile[1]
--> 710 check_lower_upper_bounds(y_preds, y_pred_low, y_pred_up)
    711 return y_preds[2], np.stack([y_pred_low, y_pred_up], axis=1)

File ~/GitHub/cqr-project/.venv/lib/python3.8/site-packages/mapie/utils.py:591, in check_lower_upper_bounds(y_preds, y_pred_low, y_pred_up)
    579 if (y_preds.ndim != 1) and any_init_inversion:
    580     warnings.warn(
    581         \"WARNING: The predictions of the quantile regression \"
    582         + \"have issues.\
The upper quantile predictions are lower\
\"
    583         + \"than the lower quantile predictions\
\"
    584         + \"at some points.\"
    585     )
    587 any_final_inversion = np.any(
    588     np.logical_or(
    589         np.logical_or(
    590             y_pred_low > y_pred_up,
--> 591             init_pred < y_pred_low,
    592         ),
    593         init_pred > y_pred_up,
    594     )
    595 )
    597 if any_final_inversion:
    598     warnings.warn(
    599         \"WARNING: The predictions have issues.\
\"
    600         + \"The upper predictions are lower than\"
    601         + \"the lower predictions at some points.\"
    602     )

MemoryError: Unable to allocate 149. GiB for an array with shape (400000, 400000) and data type bool"
}

Desktop (please complete the following information):

Additional context My dataframe contain only integer and float columns. Since there are NaN values, I tried filling them with a constant, but no luck. Also tried different values of alpha.

LacombeLouis commented 5 months ago

Hey @gabrieltardochi, I understand the problem. I think this is mostly a code efficient issue. If you want to give it a try to fix it? Otherwise, we will get on this very soon as it's indeed a problem! Thank you for pointing this out!

gabrieltardochi commented 5 months ago

Hey @gabrieltardochi, I understand the problem. I think this is mostly a code efficient issue. If you want to give it a try to fix it? Otherwise, we will get on this very soon as it's indeed a problem! Thank you for pointing this out!

I'm sorry I couldn't answer earlier. Looks like you are already on It! Thank you.

LacombeLouis commented 5 months ago

Hey @gabrieltardochi, I have made a PR. It would be great if you could check if that PR fixes your issue (with your own data)! Looking forward to have your feedback!

gabrieltardochi commented 5 months ago

Hey @gabrieltardochi, I have made a PR. It would be great if you could check if that PR fixes your issue (with your own data)! Looking forward to have your feedback!

Hi @LacombeLouis, I can confirm that it fixes my issue! Thanks :raised_hands: