Non reproducibility of results even after np.random.seed is set

cdagnino commented 4 years ago

Hi,

I expected the training and predictions to be the same after setting np.random.seed Is there another seed I should set to obtain reproducible results?

Below I have an example you can run. I'm using the current version from Github.

from ngboost.ngboost import NGBoost
from ngboost.learners import default_tree_learner
from ngboost.scores import MLE
from ngboost.distns import Normal

from sklearn.datasets import load_boston
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
import numpy as np

np.random.seed(seed=2334)

X, Y = load_boston(True)
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.2)

ngb_natural = NGBoost(Base=default_tree_learner, Dist=Normal, Score=MLE(), natural_gradient=True,
                      verbose=False)

ngb_natural.fit(X_train, Y_train)
Y_preds_nat = ngb_natural.predict(X_test)

ngb_natural2 = NGBoost(Base=default_tree_learner, Dist=Normal, Score=MLE(), natural_gradient=True,
                      verbose=False)

ngb_natural2.fit(X_train, Y_train)
Y_preds_nat2 = ngb_natural2.predict(X_test)

#This one comes out False
assert np.allclose(Y_preds_nat, Y_preds_nat2)

#This one too
assert np.allclose(Y_preds_nat, Y_preds_nat2, rtol=1e-3)

#This one is true
assert np.allclose(Y_preds_nat, Y_preds_nat2, rtol=1e-2)

avati commented 4 years ago

You need to re-set the random seed to the same value just before calling fit() the second time.

In the current way you only get reproducibility across runs of the entire script.

cdagnino commented 4 years ago

Oh, I see! That's a bit unexpected to me. Would you consider putting that somewhere in the docs? I don't know where to put something like that. If there's not reasonable place to put it, I'll just close this issue.

avati commented 4 years ago

This is a behavior of randomized algorithms in general, and nothing to do with ngboost, boosting or even machine learning. If you want reproducible runs of something that invokes the random number generator, then we need to initialize to the same seed just before calling the algorithm. You can verify this behavior with GBM from sklearn as well. I'm not sure that it is necessary to mention this in our documentation.

Perhaps we can have some example code that shows this pattern, rather than documentation?

cdagnino commented 4 years ago

I think GBM from sklearn IS reproducible, in the sense that setting a global np.random.seed is enough to get the same result from repeatedly fitting a model (see the code example below)

I think the big difference is that random_state is an optional parameter within the GBM object class. Then they use a check_random_state function see these lines to handle things. If it has value None, then it just uses the "global" set by np.seed. At least that's my understanding of it!

import numpy as np
from sklearn.datasets import make_hastie_10_2
from sklearn.ensemble import GradientBoostingClassifier

np.random.seed(seed=2334)

X, y = make_hastie_10_2()
X_train, X_test = X[:2000], X[2000:]
y_train, y_test = y[:2000], y[2000:]

clf1 = GradientBoostingClassifier(n_estimators=100, learning_rate=1.0,
    max_depth=1).fit(X_train, y_train)
y_pred1 = clf1.predict_proba(X_test)

clf2 = GradientBoostingClassifier(n_estimators=100, learning_rate=1.0,
    max_depth=1).fit(X_train, y_train) 
y_pred2 = clf2.predict_proba(X_test)

#Returns true:
assert (y_pred1 == y_pred2).all()

avati commented 4 years ago

I see that I was wrong about sklearn GBM. The function you linked seems to be crossing levels of abstraction to provide a behavior that I don't completely understand yet. Looks like GBM re-seeds itself with the global seed for every run?

alejandroschuler commented 4 years ago

To me, that's unexpected behavior from sklearn GBM! Are other models capable of doing that in sklearn, or is it just GBM? If so, I think this is more appropriately raised as an issue with sklearn, otherwise we can figure out how to make ngboost match the convention (which, imo, is weird).

cdagnino commented 4 years ago

@avati I think most algorithms in sklearn allow the user to pass a random seed to the classification or regression object itself (besides the GBM example above, check the Random Forest link here. I the random state is None, then it will use the one from numpy. If this is set globally, then it will use that one. If it's not set, then it will pick a random seed and not be reproducible.

@alejandroschuler I have no idea what is the convention, that's a good question! I do find the sklearn approach more natural, but I agree with you that what is "conventional" should be the deciding factor here.

In any case, I'm not arguing for any particular solution here, just giving you a heads up on a situation I found surprising.

Another data point: h2o allows the user to pass a random seed directly to the algorithm. See here.

I'm not sure if it has a "global" seed, but passing the same seed will ensure the same results.

avati commented 4 years ago

I think I understand. It looks like the algorithms might be attempting minimize side-effects on global state of the program (and using the global random state will alter it across runs of the algo).

I now think it would be a good idea to follow sklearn's approach to handling random state. It hands down the state to the base learner as well (https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/ensemble/_gb.py#L1236).

@cdagnino - would it be of interest to you to help implement this change?

cdagnino commented 4 years ago

@avati I can make a first try at least! I see that the NGBoost subclasses the BaseEstimator from sklearn. I'm thinking of bringing that check_random_state(seed) function from sklearn and seeing if I can make it work with the NGBoost.

How does that sound? Is there any build/test procedure I should be aware of before submitting a pull request?

cdagnino commented 4 years ago

Hi @avati , I started working on the pull request over here: https://github.com/cdagnino/ngboost/tree/global_seed

The way I see it, I need to pass this random_state variable to every function within fit (at least) that has a random component. I can guess some of those, but it would be great to have a quick chat so you can walk me over the code (you can reach me at this username at gmail). Slack, Zoom, gitter, etc would be welcome. Once I have an idea of where the randomness creeps in, it should be relatively easy to make sure that all the randomness is following the random_state

avati commented 4 years ago

Hi @cdagnino

There should be only two places that use randomness:

subsampling batches
base learner fit() internals

Everything else should be perfectly deterministic.

Could you proceed with this understanding and give it a shot? If not we can sync up on one of the those channels to discuss further.

Thanks for giving it a shot! Truly appreciate it!

cdagnino commented 4 years ago

Hey @avati Sure, I'll give it a shot. When you talk about subsampling batches, are you referring to the .sample method?

Does the line_search carry any randomness?

cdagnino commented 4 years ago

Ok! I think I got it working for the predictions

#Random Seed 23 V1
ngb = NGBRegressor(random_state=23).fit(X_train, Y_train)
Y_preds23v1 = ngb.predict(X_test)

#Random Seed 45
ngb = NGBRegressor(random_state=45).fit(X_train, Y_train)
Y_preds45 = ngb.predict(X_test)

#Try again with random Seed 23
ngb = NGBRegressor(random_state=23).fit(X_train, Y_train)
Y_preds23v2 = ngb.predict(X_test)

np.allclose(Y_preds45, Y_preds23v1)  # Spits out False, as expected
np.allclose(Y_preds23v2, Y_preds23v1) # Spits out True, as expected :)

I also made it work with minibatch_frac different from 1 ( NGBRegressor(random_state=23, minibatch_frac=0.3)) I figured the trick was to use random_state.choice instead of np.random.choice

I've only tested for the .predict(). Is there a test I should do for the distributions or something else? Let's think about a few tests that, if passed, would imply the pull request is ready

cdagnino commented 4 years ago

Hi @avati . Let me know what you think and I'll send a pull request

alejandroschuler commented 4 years ago

hey @cdagnino thanks for handling this and sorry for the delay! To answer your previous questions: 1) yes, the sample method implements the batching and 2) the line search is indeed deterministic.

It seems like the results you're showing here confirm your solution works ok. If you send the PR I'm happy to just have a look and approve it if it all looks good- I don't presume that the internals are terribly complicated.

I will definitely also be implementing some unit tests in the next week, so if you have any suggestions feel free to post!

cdagnino commented 4 years ago

Hi @alejandroschuler I've sent the pull request

alejandroschuler commented 4 years ago

Thank you!

stanfordmlgroup / ngboost

Non reproducibility of results even after np.random.seed is set #22