OOB calculation in Random Forests seems to use incorrect indices

simontindemans commented 3 years ago

update 9 March 2021

After the discussion below and further testing, I came to the conclusion that this is not a bug after all. The indices are computed correctly, because _make_estimator always generates a new random seed that is stored in the random_state variable of each tree -- even if a np.random.RandomState() object was passed to _make_estimator.

Apologies for the false alarm. I got confused by the interchangeable use of random_state as a seed or a random generator (but, interestingly, not as the actual state of the random generator).

original report

I have been looking into the OOB score calculation for random forests, and I am concerned that the indices considered to be "out-of-bag" are not actually the same as those held out for training. A few years ago, PR https://github.com/scikit-learn/scikit-learn/issues/4783 was implemented, which replaced the explicit storage of indices by a regeneration procedure.

Specifically, this resulted in the following code fragment in _compute_oob_predictions(self, X, y) in sklearn/ensemble/_forest.py:

for estimator in self.estimators_:
    unsampled_indices = _generate_unsampled_indices(
        estimator.random_state, n_samples, n_samples_bootstrap,
    )

    y_pred = self._get_oob_predictions(
        estimator, X[unsampled_indices, :]
    )
    oob_pred[unsampled_indices, ...] += y_pred
    n_oob_pred[unsampled_indices, :] += 1

If estimator.random_state is a np.RandomState object instead of a constant seed then this appears to generate new random indices instead of the original set of indices. I noticed that this potential shortcoming was also observed by @amueller in the discussion about https://github.com/scikit-learn/scikit-learn/issues/4783, but no follow-up was included there:

Silly question: This fails if a np.RandomState was passed to the forest, right? Because then the random_state would have been altered by building the tree? (or I'm tired)

Originally posted by @amueller in https://github.com/scikit-learn/scikit-learn/issues/4783#issuecomment-236036188

I think this is more generally a concern, because even if a constant seed was passed to the RandomForest initializer, then BaseForest.fit(..) implements random_state = check_random_state(self.random_state) which makes random_state a RandomState() object, followed by (a few lines below): trees = [self._make_estimator(append=False, random_state=random_state) for i in range(n_more_estimators)] All trees appear to share the same RandomState() instance, which is then used to generate indices. Calling _generate_sample_indices(...) again as part of _generate_unsampled_indices(..) would seem to always generate different indices.

Am I missing something, or can this procedure never work correctly?

simontindemans commented 3 years ago

As an aside, should _compute_oob_predictions(self, X, y) not also take a sample_weights argument? It seems to me that the OOB score should be weighted according to the sample weights.

NicolasHug commented 3 years ago

Yup, I think you might be right, thanks for the report.

Ideally we would generate the oob samples at the same place where we're generating the bagging samples (i.e. in _parallel_build_trees) to avoid reproducing the bug in the future. But that might take up a lot of memory... I guess we'll have to sample n_estimators int seed for that.

Would you like to try and provide a fix @simontindemans ?

simontindemans commented 3 years ago

Thanks for checking. At a glance, I think there are three possible approaches to fix this:

Revert to storing _indices for each tree. Simple and bomb-proof, but at the cost of memory.
Use random_state.get_state() before sampling the indices and store the state for each tree. Then, before regenerating the indices, create a new RandomState object and initialise it with set_state(old_state). That does still use the 'legacy' NumPy random number generation, but is the least intrusive way to fix this.
As you say, generate the oob samples within _parallel_build_trees. However, in order to avoid storing the outputs for all trees, it would be best to accumulate all results in-place. That requires a little more care with joblib, but it can be done using a lock and the same pattern as in _accumulate_prediction (already used for predict). This approach has the side-benefit of parallelising the OOB score calculation (it is currently sequential). One concern about this approach is the interaction with the 'warm start' feature. It will require storing the accumulated predictions for all samples for the eventuality of a warm-start, and it will be necessary to reject OOB scores on a warm start if they were not calculated previously (or only compute them for the added trees, which can lead to dangerous bugs).

It is also possible to hybridise options 2 and 3, storing the state and using a parallel accumulation in place of the current function. I don't have time this week, but might find the time to look into this in the weekend, if no-one else picks it up.

NicolasHug commented 3 years ago

I'd go with 2 but instead of storing the state, just store an int seed (for each tree), it's cheaper and that's what we do in the other places of the code: https://github.com/scikit-learn/scikit-learn/blob/42e90e9ba28fb37c2c9bd3e8aed1ac2387f1d5d5/sklearn/ensemble/_hist_gradient_boosting/gradient_boosting.py#L216-L221 Basically create a private _seeds_for_subsampling list with length n_estimators, which would be created at the first fit call and extended at each subsequent calls with warm_start.

Then use those seeds when creating the in-bag and out-of-bag samples. This doesn't change anything to parallelization I believe

simontindemans commented 3 years ago

Ok, that makes sense and should be straightforward to implement. I will do this in the coming week and submit a pull request.

simontindemans commented 3 years ago

https://github.com/scikit-learn/scikit-learn/blob/42e90e9ba28fb37c2c9bd3e8aed1ac2387f1d5d5/sklearn/ensemble/_hist_gradient_boosting/gradient_boosting.py#L216-L221

Regarding this snippet, is there any specific reason to use the uint32 maximum in combination with a u8 (64-bit) data type?

NicolasHug commented 3 years ago

I think there's one... but I don't remember ><. I think there was an issue with big seeds on 32bits architectures, something like that?

simontindemans commented 3 years ago

Ok, I guess that's possible. I will adhere to this format for the sake of consistency - and not much that can go wrong by filling the top 4 bytes with zeros.

simontindemans commented 3 years ago

I have just implemented a fix in https://github.com/simontindemans/scikit-learn/tree/rf_oob_indices_fix

Unfortunately, it fails on a number of tests (12 in sklearn/ensemble/tests/test_forest.py) and I don't have time at the moment to investigate in detail. It seems a number of them require specific numerical results that change when the order of random seeds has been modified, but perhaps there are bigger issues as well.

NicolasHug commented 3 years ago

@simontindemans would you mind submitting a PR so we can see the failures on the CI and potentially start reviewing?

simontindemans commented 3 years ago

Done. One of the issues is that warm starting (i.e. constructing the forest in two phases) does not result in the same index selections - although I'm not quite sure why.

simontindemans commented 3 years ago

I am beginning to wonder whether my initial suspicion was wrong. The name random_state is used both to refer to integer seeds and np.random.RandomState objects. The original _generate_unsampled_indices function uses tree.random_state, which is set by the _make_estimator function on the basis of the RandomState in fit(..). I had assumed this to be a RandomState object.

Delving deeper into _make_estimator, it calls _set_random_states, which sets all random_state objects in the tree to a constant integer. So perhaps this is not a bug after all - just a really clever (or too clever) way of doing things.

It still doesn't fully explain why my proposed fix does not work, though.

simontindemans commented 3 years ago

I understand now why the warm start tests fail. For an ensemble with n trees, the fit function of the existing code draws exactly n random numbers to initialise the seeds of each tree itself. When fit is called again with the warm_start argument, n dummy random numbers are drawn to reach the same state, before instantiating the random numbers of the new trees. This results in a forest created in one go having the exact same structure as one that has used `warm_start=True.

In my code modification, I draw n extra random numbers to draw seeds for bootstrapping. Having drawn them in blocks of n, there is no easy way to create a similar warm start procedure: this requires the random numbers for tree i to be drawn from the random stream at positions k*i, i.e. first draw all random numbers for 1 tree, then the next, etc.

For a future release, it would be good to switch over to numpy SeedSequence or something similar, for greater clarity and robustness. But I suspect it actually works ok now -- as long as you don't mess with the code too much.

I also looked up the 32-bit thing: apparently, RandomState(seed) only takes 32-bit seeds.

@NicolasHug , if you agree that I was mistaken in creating this issue, because integer seeds are in fact used to initialise each tree, I will close the issue and drop the PR.

NicolasHug commented 3 years ago

the fit function of the existing code draws exactly n random numbers to initialise the seeds of each tree itself. When fit is called again with the warm_start argument, n dummy random numbers are drawn to reach the same state, before instantiating the random numbers of the new trees. This results in a forest created in one go having the exact same structure as one that has used `warm_start=True.

I think this makes no sense - and the check is even incorrect. I opened #19648 to remove that behavior. If you merge #19648 (which hopefully will be accepted) into your PR branch, does that makes things go smoother? I'd rather have the implementation in your PR for the subsampling, rather than the current one which seems to give both of us a hard time.

simontindemans commented 3 years ago

I have merged your fix. That resolves most errors. There are three errors left in the test_forest test suite. All three are numerical deviations that are just outside of tolerances (0.107 instead of <= 0.100). That could well be a statistical fluke. I have done a check to see whether my indices are recomputed correctly, with a temporary np.testing.assert_equal, which seems to be ok.

simontindemans commented 3 years ago

@NicolasHug, I can confirm that the original code, although in a slightly obfuscated manner, does work correctly. See also my update in the top post. You can remove the bug label from this post, although the proposed modifications (and discussion in #19648) may still be useful to improve maintainability of the code.

adrinjalali commented 3 years ago

Removing the milestone, pinging @NicolasHug since you tagged this for 1.0

adrinjalali commented 5 months ago

Since this seems to be no more a bug (https://github.com/scikit-learn/scikit-learn/issues/19632#issuecomment-793595206), closing this and the corresponding PR.

scikit-learn / scikit-learn

OOB calculation in Random Forests seems to use incorrect indices #19632

update 9 March 2021

original report