scikit-learn-contrib / forest-confidence-interval

Confidence intervals for scikit-learn forest algorithms
http://contrib.scikit-learn.org/forest-confidence-interval/
MIT License
284 stars 48 forks source link

Overflow errors #88

Closed tawe141 closed 4 months ago

tawe141 commented 4 years ago

When using random_forest_error() with a dataset in which the features range between 0 and 1 and of datatype float64, I get a bunch of overflow errors like so:

/Users/erictaw/forest-confidence-interval/forestci/calibration.py:86: RuntimeWarning: overflow encountered in exp
  g_eta_raw = np.exp(np.dot(XX, eta)) * mask
/Users/erictaw/forest-confidence-interval/forestci/calibration.py:101: RuntimeWarning: overflow encountered in exp
  g_eta_raw = np.exp(np.dot(XX, eta_hat)) * mask
/Users/erictaw/forest-confidence-interval/forestci/calibration.py:102: RuntimeWarning: invalid value encountered in true_divide
  g_eta_main = g_eta_raw / sum(g_eta_raw)

Turning off calibration eliminates these errors, of course. Is this something I should be worried about?

PrenilS commented 4 years ago

I've been getting this error and have the same question please?

haijunli0629 commented 3 years ago

When using random_forest_error() with a dataset in which the features range between 0 and 1 and of datatype float64, I get a bunch of overflow errors like so:

/Users/erictaw/forest-confidence-interval/forestci/calibration.py:86: RuntimeWarning: overflow encountered in exp
  g_eta_raw = np.exp(np.dot(XX, eta)) * mask
/Users/erictaw/forest-confidence-interval/forestci/calibration.py:101: RuntimeWarning: overflow encountered in exp
  g_eta_raw = np.exp(np.dot(XX, eta_hat)) * mask
/Users/erictaw/forest-confidence-interval/forestci/calibration.py:102: RuntimeWarning: invalid value encountered in true_divide
  g_eta_main = g_eta_raw / sum(g_eta_raw)

Turning off calibration eliminates these errors, of course. Is this something I should be worried about?

I have the same probrem anthe the errors are gone after turning off calibration. Have you found other solutions?

haijunli0629 commented 3 years ago

When using random_forest_error() with a dataset in which the features range between 0 and 1 and of datatype float64, I get a bunch of overflow errors like so:

/Users/erictaw/forest-confidence-interval/forestci/calibration.py:86: RuntimeWarning: overflow encountered in exp
  g_eta_raw = np.exp(np.dot(XX, eta)) * mask
/Users/erictaw/forest-confidence-interval/forestci/calibration.py:101: RuntimeWarning: overflow encountered in exp
  g_eta_raw = np.exp(np.dot(XX, eta_hat)) * mask
/Users/erictaw/forest-confidence-interval/forestci/calibration.py:102: RuntimeWarning: invalid value encountered in true_divide
  g_eta_main = g_eta_raw / sum(g_eta_raw)

Turning off calibration eliminates these errors, of course. Is this something I should be worried about?

@tawe141

When turning off calibration, the V_IJ_unbias array will contain negetive values, which was mentioned in #25 . If not, all the output is NaN. Do you have any solutions to this?

Thanks.

sylphrena0 commented 2 years ago

I am still experiencing this issue.

el-hult commented 4 months ago

The method in this library estimate V, the infenitesimal jackknife variance. It is valid if you have a lot of data (large n) and a lot of trees in the forest (large B). They are really only exactly valid in the limit when n and B go to infinity.

When you have finite number of trees, you get a bias. The library uses bias correction to attempt and fix this bias. The bias correction is also valid only for large enough n and B. If n or B is too small, the bias correction can result in negative variance estimates for V.

The calibration routine tries to fix this problem. It uses a empirical bayes hierarchical model to adjust the variance estimates V. If your distribution of uncalibrated V does not correspond to the parametrical modelling assumptions, the calibration routine will not help. And if n or B is too small, the empirical distribution of V does likely not follow the parametric model.

In conclusion: Collect more data and increase the size of the random forest.