Closed ondrejiayc closed 7 years ago
👍
@ondrejiayc @tarikd Thank you for your interest in our package and the questions you’ve raised. We are working to clarify our code and resolve some issues.
Same here. It's either a coding issue or the method itself will not necessarily return a positive "corrected" value. Please let us know if it is solved! Thanks!
Any updates as of yet on the negative values? For the most part, it's been great. That's the lone hang-up I've had.
If you compare this implementation of Wager's random forest confidence interval to the original R code, there is an additional calibration step that is missing in the python code. The calibration is suppose to fix the problem of negative variance estimates.
https://github.com/swager/randomForestCI/blob/master/R/infinitesimalJackknife.R
Implemented in #49. We're waiting to finish things for the JOSS review, before merging that one and cutting a release.
I believe this should be resolved with the new release. Closing.
Hi,
first of all, great work, this is a great tool! I have a couple of questions based on issues I've encountered when playing with the package. Apologies if these reveal my misunderstanding rather than an actual issue with the coding.
1) When running the confidence interval calculation on a forest I trained, I encounter negative values of the unbiased variances. Additionally, the more trees my forest has, the more of these negative values appear. Could there be some kind of bias overcorrection?
2) The
_bias_correction
function in the module calculatesn_var
parameter, that it then applies to the bias correction vector. However, no such expression appears in Eqn. (7) of the Wagner et al. (2014), according to which the bias correction should ben_train_samples * boot_var / n_trees
(using the variable names from the package code). Where doesn_var
come from?3) I don't see any parameter regulating the number of bootstrap draws. Even though O(n) draws should be enough to take case of the Monte Carlo noise, it should still be possible to control this somehow. If I change the
n_samples
parameter, this clashes with thepred
matrix, which is fixed to the number of trees in the forest. How to regulate the number of draws?4) In fact, if I'm reading the paper right, the idea is to look at how the predictions from the individual trees change when using different bootstrap samples of the original data. That doesn't seem to be what the package is doing, which is using predictions from a single forest on a set of test data instead of predictions of multiple forests of a single new sample. Where is my understanding wrong?
Thanks and again, let me know if what I'm asking is off-topic for here.
Ondrej