Closed simontindemans closed 5 months ago
As an aside, should _compute_oob_predictions(self, X, y)
not also take a sample_weights
argument? It seems to me that the OOB score should be weighted according to the sample weights.
Yup, I think you might be right, thanks for the report.
Ideally we would generate the oob samples at the same place where we're generating the bagging samples (i.e. in _parallel_build_trees
) to avoid reproducing the bug in the future. But that might take up a lot of memory... I guess we'll have to sample n_estimators
int seed for that.
Would you like to try and provide a fix @simontindemans ?
Thanks for checking. At a glance, I think there are three possible approaches to fix this:
_indices
for each tree. Simple and bomb-proof, but at the cost of memory.random_state.get_state()
before sampling the indices and store the state for each tree. Then, before regenerating the indices, create a new RandomState object and initialise it with set_state(old_state)
. That does still use the 'legacy' NumPy random number generation, but is the least intrusive way to fix this._parallel_build_trees
. However, in order to avoid storing the outputs for all trees, it would be best to accumulate all results in-place. That requires a little more care with joblib, but it can be done using a lock and the same pattern as in _accumulate_prediction
(already used for predict
). This approach has the side-benefit of parallelising the OOB score calculation (it is currently sequential). One concern about this approach is the interaction with the 'warm start' feature. It will require storing the accumulated predictions for all samples for the eventuality of a warm-start, and it will be necessary to reject OOB scores on a warm start if they were not calculated previously (or only compute them for the added trees, which can lead to dangerous bugs).It is also possible to hybridise options 2 and 3, storing the state and using a parallel accumulation in place of the current function. I don't have time this week, but might find the time to look into this in the weekend, if no-one else picks it up.
I'd go with 2 but instead of storing the state, just store an int seed (for each tree), it's cheaper and that's what we do in the other places of the code:
https://github.com/scikit-learn/scikit-learn/blob/42e90e9ba28fb37c2c9bd3e8aed1ac2387f1d5d5/sklearn/ensemble/_hist_gradient_boosting/gradient_boosting.py#L216-L221
Basically create a private _seeds_for_subsampling
list with length n_estimators
, which would be created at the first fit
call and extended at each subsequent calls with warm_start
.
Then use those seeds when creating the in-bag and out-of-bag samples. This doesn't change anything to parallelization I believe
Ok, that makes sense and should be straightforward to implement. I will do this in the coming week and submit a pull request.
Regarding this snippet, is there any specific reason to use the uint32 maximum in combination with a u8 (64-bit) data type?
I think there's one... but I don't remember ><. I think there was an issue with big seeds on 32bits architectures, something like that?
Ok, I guess that's possible. I will adhere to this format for the sake of consistency - and not much that can go wrong by filling the top 4 bytes with zeros.
I have just implemented a fix in https://github.com/simontindemans/scikit-learn/tree/rf_oob_indices_fix
Unfortunately, it fails on a number of tests (12 in sklearn/ensemble/tests/test_forest.py
) and I don't have time at the moment to investigate in detail. It seems a number of them require specific numerical results that change when the order of random seeds has been modified, but perhaps there are bigger issues as well.
@simontindemans would you mind submitting a PR so we can see the failures on the CI and potentially start reviewing?
Done. One of the issues is that warm starting (i.e. constructing the forest in two phases) does not result in the same index selections - although I'm not quite sure why.
I am beginning to wonder whether my initial suspicion was wrong. The name random_state
is used both to refer to integer seeds and np.random.RandomState objects. The original _generate_unsampled_indices
function uses tree.random_state
, which is set by the _make_estimator
function on the basis of the RandomState in fit(..)
. I had assumed this to be a RandomState object.
Delving deeper into _make_estimator
, it calls _set_random_states
, which sets all random_state
objects in the tree to a constant integer. So perhaps this is not a bug after all - just a really clever (or too clever) way of doing things.
It still doesn't fully explain why my proposed fix does not work, though.
I understand now why the warm start tests fail. For an ensemble with n trees, the fit
function of the existing code draws exactly n random numbers to initialise the seeds of each tree itself. When fit
is called again with the warm_start
argument, n dummy random numbers are drawn to reach the same state, before instantiating the random numbers of the new trees. This results in a forest created in one go having the exact same structure as one that has used `warm_start=True
.
In my code modification, I draw n extra random numbers to draw seeds for bootstrapping. Having drawn them in blocks of n, there is no easy way to create a similar warm start procedure: this requires the random numbers for tree i to be drawn from the random stream at positions k*i, i.e. first draw all random numbers for 1 tree, then the next, etc.
For a future release, it would be good to switch over to numpy SeedSequence or something similar, for greater clarity and robustness. But I suspect it actually works ok now -- as long as you don't mess with the code too much.
I also looked up the 32-bit thing: apparently, RandomState(seed)
only takes 32-bit seeds.
@NicolasHug , if you agree that I was mistaken in creating this issue, because integer seeds are in fact used to initialise each tree, I will close the issue and drop the PR.
the fit function of the existing code draws exactly n random numbers to initialise the seeds of each tree itself. When fit is called again with the warm_start argument, n dummy random numbers are drawn to reach the same state, before instantiating the random numbers of the new trees. This results in a forest created in one go having the exact same structure as one that has used
`warm_start=True
.
I think this makes no sense - and the check is even incorrect. I opened #19648 to remove that behavior. If you merge #19648 (which hopefully will be accepted) into your PR branch, does that makes things go smoother? I'd rather have the implementation in your PR for the subsampling, rather than the current one which seems to give both of us a hard time.
I have merged your fix. That resolves most errors. There are three errors left in the test_forest
test suite. All three are numerical deviations that are just outside of tolerances (0.107 instead of <= 0.100). That could well be a statistical fluke. I have done a check to see whether my indices are recomputed correctly, with a temporary np.testing.assert_equal
, which seems to be ok.
@NicolasHug, I can confirm that the original code, although in a slightly obfuscated manner, does work correctly. See also my update in the top post. You can remove the bug label from this post, although the proposed modifications (and discussion in #19648) may still be useful to improve maintainability of the code.
Removing the milestone, pinging @NicolasHug since you tagged this for 1.0
Since this seems to be no more a bug (https://github.com/scikit-learn/scikit-learn/issues/19632#issuecomment-793595206), closing this and the corresponding PR.
update 9 March 2021
After the discussion below and further testing, I came to the conclusion that this is not a bug after all. The indices are computed correctly, because
_make_estimator
always generates a new random seed that is stored in therandom_state
variable of each tree -- even if anp.random.RandomState()
object was passed to_make_estimator
.Apologies for the false alarm. I got confused by the interchangeable use of
random_state
as a seed or a random generator (but, interestingly, not as the actual state of the random generator).original report
I have been looking into the OOB score calculation for random forests, and I am concerned that the indices considered to be "out-of-bag" are not actually the same as those held out for training. A few years ago, PR https://github.com/scikit-learn/scikit-learn/issues/4783 was implemented, which replaced the explicit storage of indices by a regeneration procedure.
Specifically, this resulted in the following code fragment in
_compute_oob_predictions(self, X, y)
insklearn/ensemble/_forest.py
:If
estimator.random_state
is a np.RandomState object instead of a constant seed then this appears to generate new random indices instead of the original set of indices. I noticed that this potential shortcoming was also observed by @amueller in the discussion about https://github.com/scikit-learn/scikit-learn/issues/4783, but no follow-up was included there:Originally posted by @amueller in https://github.com/scikit-learn/scikit-learn/issues/4783#issuecomment-236036188
I think this is more generally a concern, because even if a constant seed was passed to the RandomForest initializer, then BaseForest.fit(..) implements
random_state = check_random_state(self.random_state)
which makesrandom_state
a RandomState() object, followed by (a few lines below):trees = [self._make_estimator(append=False, random_state=random_state) for i in range(n_more_estimators)]
All trees appear to share the same RandomState() instance, which is then used to generate indices. Calling_generate_sample_indices(...)
again as part of_generate_unsampled_indices(..)
would seem to always generate different indices.Am I missing something, or can this procedure never work correctly?