scikit-learn-contrib / forest-confidence-interval

Confidence intervals for scikit-learn forest algorithms
http://contrib.scikit-learn.org/forest-confidence-interval/
MIT License
284 stars 48 forks source link

negative V_IJ_unbiased #25

Closed ondrejiayc closed 7 years ago

ondrejiayc commented 8 years ago

Hi,

first of all, great work, this is a great tool! I have a couple of questions based on issues I've encountered when playing with the package. Apologies if these reveal my misunderstanding rather than an actual issue with the coding.

1) When running the confidence interval calculation on a forest I trained, I encounter negative values of the unbiased variances. Additionally, the more trees my forest has, the more of these negative values appear. Could there be some kind of bias overcorrection?

2) The _bias_correction function in the module calculates n_var parameter, that it then applies to the bias correction vector. However, no such expression appears in Eqn. (7) of the Wagner et al. (2014), according to which the bias correction should be n_train_samples * boot_var / n_trees (using the variable names from the package code). Where does n_var come from?

3) I don't see any parameter regulating the number of bootstrap draws. Even though O(n) draws should be enough to take case of the Monte Carlo noise, it should still be possible to control this somehow. If I change the n_samples parameter, this clashes with the pred matrix, which is fixed to the number of trees in the forest. How to regulate the number of draws?

4) In fact, if I'm reading the paper right, the idea is to look at how the predictions from the individual trees change when using different bootstrap samples of the original data. That doesn't seem to be what the package is doing, which is using predictions from a single forest on a set of test data instead of predictions of multiple forests of a single new sample. Where is my understanding wrong?

Thanks and again, let me know if what I'm asking is off-topic for here.

Ondrej

tarikd commented 8 years ago

👍

kpolimis commented 8 years ago

@ondrejiayc @tarikd Thank you for your interest in our package and the questions you’ve raised. We are working to clarify our code and resolve some issues.

kytexie commented 7 years ago

Same here. It's either a coding issue or the method itself will not necessarily return a positive "corrected" value. Please let us know if it is solved! Thanks!

RetepAdam commented 7 years ago

Any updates as of yet on the negative values? For the most part, it's been great. That's the lone hang-up I've had.

jeffyangchen commented 7 years ago

If you compare this implementation of Wager's random forest confidence interval to the original R code, there is an additional calibration step that is missing in the python code. The calibration is suppose to fix the problem of negative variance estimates.

https://github.com/swager/randomForestCI/blob/master/R/infinitesimalJackknife.R

arokem commented 7 years ago

Implemented in #49. We're waiting to finish things for the JOSS review, before merging that one and cutting a release.

arokem commented 7 years ago

I believe this should be resolved with the new release. Closing.