scikit-learn-contrib / forest-confidence-interval

Confidence intervals for scikit-learn forest algorithms
http://contrib.scikit-learn.org/forest-confidence-interval/
MIT License
283 stars 47 forks source link

Zero confidence intervals #72

Open swarnendubiswas opened 6 years ago

swarnendubiswas commented 6 years ago

Hi,

I am trying to generate confidence intervals for the below sample data. I am using RandomForestRegressor with bootstrapping enabled.

X_train shape (270, 7) [[ 12. 20. 1. ... 300. 1. 0.] [ 12. 20. 1. ... 300. 1. 0.] [ 12. 20. 1. ... 300. 1. 0.] ... [ 12. 30. 10. ... 300. 1. 0.] [ 12. 30. 10. ... 300. 1. 0.] [ 12. 30. 10. ... 300. 1. 0.]]

Test data shape (36,7) [[ 12. 10. 1. 1. 300. 1. 0.] [ 12. 10. 5. 1. 300. 1. 0.] [ 12. 10. 10. 1. 300. 1. 0.] ... [ 12. 10. 1. 4. 300. 1. 0.] [ 12. 20. 1. 4. 300. 1. 0.] [ 12. 30. 1. 4. 300. 1. 0.]]

I generate ci data as

ci_data = fci.random_forest_error(model, x_train, x_test,  calibrate=True)

However, ci_data contains all zeroes

[1.89472718e-34 1.89472718e-34 1.89472718e-34 1.89472718e-34 1.89472718e-34 1.89472718e-34 1.89472718e-34 1.89472718e-34 1.89472718e-34 1.89472718e-34 1.89472718e-34 1.89472718e-34 1.89472718e-34 1.89472718e-34 1.89472718e-34 1.89472718e-34 1.89472718e-34 1.89472718e-34 1.89472718e-34 1.89472718e-34 1.89472718e-34 1.89472718e-34 1.89472718e-34 1.89472718e-34 1.89472718e-34 1.89472718e-34 1.89472718e-34 1.89472718e-34 1.89472718e-34 1.89472718e-34 1.89472718e-34 1.89472718e-34 1.89472718e-34 1.89472718e-34 1.89472718e-34 1.89472718e-34]

Do you have any pointers as to what could be going wrong here? Thanks.

arokem commented 6 years ago

I'm not sure. Might be related to the feature that has the same value for every observation?

On Wed, Feb 28, 2018 at 5:45 PM, Swarnendu Biswas notifications@github.com wrote:

Hi,

I am trying to generate confidence intervals for the below sample data. I am using RandomForestRegressor with bootstrapping enabled.

X_train shape (270, 7) [[ 12. 20. 1. ... 300. 1. 0.] [ 12. 20. 1. ... 300. 1. 0.] [ 12. 20. 1. ... 300. 1. 0.] ... [ 12. 30. 10. ... 300. 1. 0.] [ 12. 30. 10. ... 300. 1. 0.] [ 12. 30. 10. ... 300. 1. 0.]]

Test data shape (36,7) [[ 12. 10. 1. 1. 300. 1. 0.] [ 12. 10. 5. 1. 300. 1. 0.] [ 12. 10. 10. 1. 300. 1. 0.] ... [ 12. 10. 1. 4. 300. 1. 0.] [ 12. 20. 1. 4. 300. 1. 0.] [ 12. 30. 1. 4. 300. 1. 0.]]

I generate ci data as

ci_data = fci.random_forest_error(model, x_train, x_test, calibrate=True)

However, ci_data contains all zeroes

[1.89472718e-34 1.89472718e-34 1.89472718e-34 1.89472718e-34 1.89472718e-34 1.89472718e-34 1.89472718e-34 1.89472718e-34 1.89472718e-34 1.89472718e-34 1.89472718e-34 1.89472718e-34 1.89472718e-34 1.89472718e-34 1.89472718e-34 1.89472718e-34 1.89472718e-34 1.89472718e-34 1.89472718e-34 1.89472718e-34 1.89472718e-34 1.89472718e-34 1.89472718e-34 1.89472718e-34 1.89472718e-34 1.89472718e-34 1.89472718e-34 1.89472718e-34 1.89472718e-34 1.89472718e-34 1.89472718e-34 1.89472718e-34 1.89472718e-34 1.89472718e-34 1.89472718e-34 1.89472718e-34]

Do you have any pointers as to what could be going wrong here? Thanks.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/scikit-learn-contrib/forest-confidence-interval/issues/72, or mute the thread https://github.com/notifications/unsubscribe-auth/AAHPNuEqqPpo7unzy3ag1M-mIRXa-gJtks5tZ1KrgaJpZM4SXodE .

swarnendubiswas commented 6 years ago

Thanks for the tip. I had tried it, but it does not help. I trimmed my data set to now contain only features that change and the label. But I still get the zero CIs.

Without normalization (which is possibly not a must for random forests), I get the following error:

/usr/local/lib/python3.6/dist-packages/forestci/calibration.py:102: RuntimeWarning: invalid value encountered in true_divide
  g_eta_main = g_eta_raw / sum(g_eta_raw)

This happens because all entries in mask are zero:

    def neg_loglik(eta):
        mask = np.ones_like(xvals)
        mask[np.where(xvals <= 0)[0]] = 0

I have attached the data csv if you are interested.

data.zip

arokem commented 6 years ago

Could you also send along the code you ran?

swarnendubiswas commented 6 years ago

I have got it to work if I use a StandardScaler() or MinMaxScaler(). Otherwise, I get the following error:

/lib/python3.6/site-packages/forestci/calibration.py:86: RuntimeWarning: overflow encountered in exp
  g_eta_raw = np.exp(np.dot(XX, eta)) * mask
/lib/python3.6/site-packages/forestci/calibration.py:102: RuntimeWarning: invalid value encountered in true_divide
  g_eta_main = g_eta_raw / sum(g_eta_raw)
[nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan
 nan nan nan nan nan nan nan nan nan]

data.csv.zip This is the code and I have attached the data file:

import csv
import sys
import numpy as np
from matplotlib import pyplot as plt
from sklearn.ensemble import RandomForestRegressor
from sklearn.preprocessing import StandardScaler, MinMaxScaler
import sklearn.model_selection as xval
import forestci as fci

# Read data from csv file
file = open("./data.csv", "r")
mpg_X = []
mpg_Y = []
reader = csv.reader(file)
for line in reader:
    line = [float(x) for x in line]
    mpg_X.append(line[1:4])
    mpg_Y.append(line[-1])

mpg_X = np.array(mpg_X)
mpg_Y = np.array(mpg_Y)

# xscaler = MinMaxScaler()
# yscaler = MinMaxScaler()

# n_mpg_x = xscaler.fit_transform(mpg_X)
# n_mpg_y = yscaler.fit_transform(mpg_Y.reshape(-1, 1))
n_mpg_x = mpg_X
n_mpg_y = mpg_Y

# split mpg data into training and test set
mpg_X_train, mpg_X_test, mpg_y_train, mpg_y_test = xval.train_test_split(n_mpg_x, n_mpg_y, test_size=0.1,
                                                                         random_state=42)

mpg_forest = RandomForestRegressor(n_estimators=200, random_state=42)
mpg_forest.fit(mpg_X_train, mpg_y_train.ravel())
mpg_y_hat = mpg_forest.predict(mpg_X_test)

# Calculate the variance:
# inbag = fci.calc_inbag(mpg_X_train.shape[0], mpg_forest)
mpg_V_IJ_unbiased = fci.random_forest_error(mpg_forest, mpg_X_train, mpg_X_test,   # inbag=inbag,
                                            calibrate=True)
print(mpg_V_IJ_unbiased)

plt.scatter(mpg_y_test, mpg_y_hat)
min_x = min(min(mpg_y_test), min(mpg_y_hat))
max_x = max(max(mpg_y_test), max(mpg_y_hat))

plt.plot([min_x, max_x], [min_x, max_x], 'k--')
plt.xlabel('Reported label')
plt.ylabel('Predicted label')
plt.show()

plt.errorbar(mpg_y_test, mpg_y_hat, yerr=np.sqrt(mpg_V_IJ_unbiased), fmt='o')
plt.plot([min_x, max_x], [min_x, max_x], 'k--')
plt.xlabel('Reported label')
plt.ylabel('Predicted label')
plt.show()
swarnendubiswas commented 6 years ago

@arokem Did you have a chance to try out the code? Were you able to reproduce the error I faced?

stl-christywilloughby commented 5 years ago

Thanks for posting this. I was also getting nans, and was able to work backwards by starting with your example and filling in my own data.

swarnendubiswas commented 5 years ago

@stl-christywilloughby Welcome. Have you been able to fix the NaNs? If yes, were you able to identify any patterns in the data or usage that causes this?

charlesxjyang commented 5 years ago

I also get the same error: /lib/python3.6/site-packages/forestci/calibration.py:102: RuntimeWarning: invalid value encountered in true_divide g_eta_main = g_eta_raw / sum(g_eta_raw)

I've been running this code on my own dataset and this error appears both when I use StandardScaler and if I don't use StandardScaler. Is there any update on the potential cause for this?

jayarehart commented 4 years ago

I am having the same error using my own dataset. It appears that not all of the features in my dataset produce this error (i.e., when I subsample my dataset, the code works, depending upon the exact subsample chosen):

/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/forestci/calibration.py:86: RuntimeWarning: overflow encountered in exp g_eta_raw = np.exp(np.dot(XX, eta)) * mask

/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/forestci/calibration.py:101: RuntimeWarning: overflow encountered in exp g_eta_raw = np.exp(np.dot(XX, eta_hat)) * mask

/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/forestci/calibration.py:102: RuntimeWarning: invalid value encountered in true_divide g_eta_main = g_eta_raw / sum(g_eta_raw)

haijunli0629 commented 3 years ago

@swarnendubiswas

I encountered the same issue. After I tried out your codes with MinMaxScaler(), this probrem is solved with confidence intervals in the scale of the MinMaxScaler. How could we restore these confidence intervals as the initial units?

Thanks.

image

swarnendubiswas commented 3 years ago

@smile4lee Sorry I do not get your question.

haijunli0629 commented 3 years ago

@smile4lee Sorry I do not get your question.

@swarnendubiswas Sorry for the confusion. I mean the similar issues mentioned in #83. We need the variance in same order as the orginal data (without scaler), how could we transform the variances corrersponding to the orginal data?

DariusRoman commented 3 years ago

According to the documentation the forestci.random_forest_error performs calibration. Set the calibration to False and you will not receive the NaNs. As for the calibration method you will have to go in detail over the code.

I personally do the calibration after I obtained the standard deviation with forestci.random_forest_error

Hope this helps.

Niccolo-Ajroldi commented 2 years ago

Is there any update regarding this issue? Avoid calibration doesn't seem to be a solution, since estimated variance can be negative.

Thank you in advance!

itsamejoshab commented 1 year ago

This thread helped me realize that the issue was with calibration, so I turned it off. But as @Niccolo-Ajroldi said, this is not a solution but a work around. I will do the calibration differently for now

el-hult commented 2 weeks ago

There were some unfortunate details in the calibration routine that made it fail in some cases. This is partially solved solved by my proposed PR #114

However, if you have too few trees in the forest, or you have certain examples in your data with large irreducible variance, the calibration approach will still fail. Increasing the number of trees, or handling outliers, seems to be the way to go.