scikit-learn / enhancement_proposals

Enhancement proposals for scikit-learn: structured discussions and rational for large additions and modifications
https://scikit-learn-enhancement-proposals.readthedocs.io/en/latest
BSD 3-Clause "New" or "Revised" License
48 stars 34 forks source link

SLEP011: Fixing randomness handling in estimators and splitters #24

Closed NicolasHug closed 2 years ago

NicolasHug commented 4 years ago

This SLEP aims at fixing the issues related to passing RandomState instances or None to estimators and CV splitters.

The proposed change is to store an actual random state as returned by numpy's get_state() method in the self.random_state attribute.

The net result is that calling fit or split on the same instance will always use the same RNG. The changes are mostly backward compatible.

This is a follow up to https://github.com/scikit-learn/scikit-learn/issues/14042 and https://github.com/scikit-learn/scikit-learn/pull/15177.

@scikit-learn/core-devs , this probably looks more like a manifesto than a SLEP at the moment, so please share your comments ;)

amueller commented 4 years ago

I didn't do a review but from a short discussion: I think we need to decide on whether we want "None" to result in the same estimator for multiple runs of fit. I think in particular in cross-validation it might be useful to have different random states.

NicolasHug commented 4 years ago

I think we need to decide on whether we want "None" to result in the same estimator for multiple runs of fit.

This SLEP's answer is no. The reason is (as detailed in the SLEP): allowing fit to differ for a given instance is precisely the cause of all these bugs and unexpected behaviours.

I think in particular in cross-validation it might be useful to have different random states.

Yet we/users mostly use random_state=0, right?

amueller commented 4 years ago

Yet we/users mostly use random_state=0, right?

I don't think that's true.

jnothman commented 4 years ago

fit can also store the initial random state to make it reproducible, without changing behaviours otherwise

NicolasHug commented 4 years ago

Thanks for the comments Joel

Wouldn't a more compliant variant of the proposed solution simply copy the RandomState in check_random_state?

Probably, I just didn't want to modify check_random_state.

There are other related ideas: testing with multiple random states; and using fixed seed by default in all cases

Changing the defaults might minimize issues on the users side. But it only partially addresses the main issues raised in this SLEP. As long as we allow estimators and splitters to be stateful, we are still exposing ourselves to the bugs described here.

fit can also store the initial random state to make it reproducible, without changing behaviours otherwise

This is discussed in SLEP here, I believe https://github.com/scikit-learn/enhancement_proposals/pull/24/files#diff-5f21dbd421ae7420240d592cf8e56332R351

GaelVaroquaux commented 4 years ago

Rather than make scattered comments on the PR, let me try to summarize thoughts that arose of a discussion with @amueller, @adrinjalali and @agramfort at NeurIPS. I think that it would be great to integrate these notes in the SLEP, partly because the goal of the SLEP is to document the reasons for the behavior.

Problem1 Origin of the SLEP: SuccessiveHalving, but also other suboptimal behaviors such a warm restart.

Problem2 Other problem mentioned: Stateful and non-deterministic behavior are annoying (because for an advanced usage, stateless and deterministic are usually better)

Two aspects of considered modifications should be separated :

Question1 Default: should behavior by default be random or deterministic? Question2 Dealing with an "rng": should rng objects be valid inputs for random_state?

Considerations for defining the API

We have two populations we are targeting: naive users and advanced. Ideally, things should be easy/natural for naive users, but let room for advanced patterns.

For question1 To address question 1 above, we need to consider: what is the "naive" user expectation, where naive is defined as not expert in scikit-learn? In other terms: to teach (eg to teach random forests), what is the best default? Things called "Random" are often expected to be random.

For question2 An advanced usage that is really important is to enable to have a complete chain of operations with a large entropy, yet keep it reproducible. Seeding the global rng is not a good solution, because in parallel computing it breaks down to out-of-order execution.

To take a specific example, one might want to apply cross_val_score on an estimator with randomness such as random forest, and gauge in the cross_val_score multiple sources of variance in the score, including that due to this randomness.

I do not see a way of answering this need in a general setting without passing a random number generator around. Passing a random generator around is actual a common pattern in numerical computing.

Redefining the contract?

Problem1 is a serious one: it arises from statefullness and the lack determinism of that makes some computing patterns unreliable.

Taking a step back, to address it, we might need a mechanism to enforce determinism in an estimator in the context of these computing patterns.

Overriding random_state

The simplest approach would be that in such a context, we could modify the estimator that is input with something like (code not tested):

def freeze_random_state(estimator, msg='here'):
    if not hasattr(estimator, 'random_state'):
        return estimator
    random_state = estimator.random_state
    if not isinstance(random_state, np.random.RandomState):
        return estimator
    warning.warn('random_state parameter overriden. Indeed, estimator %s has an uncontrolled '
                 'random state leading to non deterministic behavior, which leads to inconsistent '
                 'behavior %s.' % (estimator, msg))
    estimator.random_state = estimator.random_state.randint(2**32)
    return(estimator)

We would call this code (or similar), where appropriate, ie only where non-deterministic behavior lead to inconsistency.

The draw back is that it leads to warnings and special cases.

Imposing determinism

A more invasive change (hence that requires more thought), is along what is proposed in the SLEP: at init (or at first fit), run check_random_state, and if the it is a random_state, turn it to a random int.

A drawback is that it renders our API more complicated (things happen at init). As a result, libraries who implement our API may not do this, and hence lead to inconsistent behavior.

Also, it may violate users expectations as multiple fits of the same random forest will return the same thing, though not multiple instanciations / clones.

GaelVaroquaux commented 4 years ago

I just realized that enforcing determinism at first fit has a strong drawback: it means that the behavior of fit is strongly dependent on the past of the instance.

NicolasHug commented 4 years ago

I just realized that enforcing determinism at first fit has a strong drawback: it means that the behavior of fit is strongly dependent on the past of the instance.

That's the main reason it's not the proposed solution, and that's noted in the SLEP already.

the behavior of fit is strongly dependent on the past of the instance.

This precisely applies to the current design and this is exactly what I want to fix.

rth commented 4 years ago

Thanks for this work! I think we should add a section with considerations about the compatibility of the proposed solution with future support for the new RNG API in numpy (NEP 19 and https://github.com/scikit-learn/scikit-learn/issues/16988).

I think storing only the random seed in estimators as proposed here would potentially allow to switch the RNG backend with some config option. Of course, the goal of this SLEP is orthogonal, but I think any in depth refactoring of the random state should take into account that in the long term focusing too much on np.random.RandomState might not be the best approach. From NEP 19,

All current usages of RandomState will continue to work in perpetuity, though some may be discouraged through documentation

cc @grisaitis

NicolasHug commented 4 years ago

@GaelVaroquaux , thank you for your notes in https://github.com/scikit-learn/enhancement_proposals/pull/24#issuecomment-565761353.

I have read them carefully, but unfortunately, I am not quite sure how I can leverage these notes to make this SLEP move forward, or how to integrate them in the SLEP. There are a few things in these notes that I don't completely understand, but also some other things that I don't fully agree with.

Problem1 Origin of the SLEP: SuccessiveHalving

The SH issue isn't the origin of the SLEP, it's only a data point. The origin is scikit-learn/scikit-learn#14042, and the numerous bugs we've had to fix so far. I want to make that clear because otherwise you might misinterpret my intentions here, which go way beyond the SH issue

I also don't share the framing of the problem that was proposed, with these two questions: 1. whether randomness should be the default, and 2. whether we should allow passing RNGs. While these two questions are definitely worth considering, I believe they are mostly tangential to this SLEP and to the points that I'm trying to make throughout this doc. To me, and I hope this is properly illustrated in this SLEP, the main relevant question is: Do we want our estimators (and splitters) to be stateful across calls to fit() (and split()).

If I interpret your comments properly (https://github.com/scikit-learn/enhancement_proposals/pull/24#issuecomment-565762066, and https://github.com/scikit-learn/enhancement_proposals/pull/24#discussion_r357943873) you and I seem to agree that the answer should be no ;). I might be misinterpreting your thoughts here, so I'd appreciate if you could clarify that for me when you have time.

Thanks!