Closed cdagnino closed 4 years ago
You need to re-set the random seed to the same value just before calling fit() the second time.
In the current way you only get reproducibility across runs of the entire script.
Oh, I see! That's a bit unexpected to me. Would you consider putting that somewhere in the docs? I don't know where to put something like that. If there's not reasonable place to put it, I'll just close this issue.
This is a behavior of randomized algorithms in general, and nothing to do with ngboost, boosting or even machine learning. If you want reproducible runs of something that invokes the random number generator, then we need to initialize to the same seed just before calling the algorithm. You can verify this behavior with GBM from sklearn as well. I'm not sure that it is necessary to mention this in our documentation.
Perhaps we can have some example code that shows this pattern, rather than documentation?
I think GBM from sklearn IS reproducible, in the sense that setting a global np.random.seed
is enough to get the same result from repeatedly fitting a model (see the code example below)
I think the big difference is that random_state
is an optional parameter within the GBM object class. Then they use a check_random_state
function
see these lines to handle things. If it has value None
, then it just uses the "global" set by np.seed
. At least that's my understanding of it!
import numpy as np
from sklearn.datasets import make_hastie_10_2
from sklearn.ensemble import GradientBoostingClassifier
np.random.seed(seed=2334)
X, y = make_hastie_10_2()
X_train, X_test = X[:2000], X[2000:]
y_train, y_test = y[:2000], y[2000:]
clf1 = GradientBoostingClassifier(n_estimators=100, learning_rate=1.0,
max_depth=1).fit(X_train, y_train)
y_pred1 = clf1.predict_proba(X_test)
clf2 = GradientBoostingClassifier(n_estimators=100, learning_rate=1.0,
max_depth=1).fit(X_train, y_train)
y_pred2 = clf2.predict_proba(X_test)
#Returns true:
assert (y_pred1 == y_pred2).all()
I see that I was wrong about sklearn GBM. The function you linked seems to be crossing levels of abstraction to provide a behavior that I don't completely understand yet. Looks like GBM re-seeds itself with the global seed for every run?
To me, that's unexpected behavior from sklearn GBM! Are other models capable of doing that in sklearn, or is it just GBM? If so, I think this is more appropriately raised as an issue with sklearn, otherwise we can figure out how to make ngboost match the convention (which, imo, is weird).
@avati I think most algorithms in sklearn allow the user to pass a random seed to the classification or regression object itself (besides the GBM example above, check the Random Forest link here. I the random state is None, then it will use the one from numpy
. If this is set globally, then it will use that one. If it's not set, then it will pick a random seed and not be reproducible.
@alejandroschuler I have no idea what is the convention, that's a good question! I do find the sklearn
approach more natural, but I agree with you that what is "conventional" should be the deciding factor here.
In any case, I'm not arguing for any particular solution here, just giving you a heads up on a situation I found surprising.
Another data point: h2o
allows the user to pass a random seed directly to the algorithm. See here.
I'm not sure if it has a "global" seed, but passing the same seed will ensure the same results.
I think I understand. It looks like the algorithms might be attempting minimize side-effects on global state of the program (and using the global random state will alter it across runs of the algo).
I now think it would be a good idea to follow sklearn's approach to handling random state. It hands down the state to the base learner as well (https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/ensemble/_gb.py#L1236).
@cdagnino - would it be of interest to you to help implement this change?
@avati I can make a first try at least!
I see that the NGBoost
subclasses the BaseEstimator
from sklearn
.
I'm thinking of bringing that check_random_state(seed)
function from sklearn
and seeing if I can make it work with the NGBoost
.
How does that sound? Is there any build/test procedure I should be aware of before submitting a pull request?
Hi @avati , I started working on the pull request over here: https://github.com/cdagnino/ngboost/tree/global_seed
The way I see it, I need to pass this random_state
variable to every function within fit
(at least) that has a random component. I can guess some of those, but it would be great to have a quick chat so you can walk me over the code (you can reach me at this username at gmail). Slack, Zoom, gitter, etc would be welcome.
Once I have an idea of where the randomness creeps in, it should be relatively easy to make sure that all the randomness is following the random_state
Hi @cdagnino
There should be only two places that use randomness:
subsampling batches
base learner fit() internals
Everything else should be perfectly deterministic.
Could you proceed with this understanding and give it a shot? If not we can sync up on one of the those channels to discuss further.
Thanks for giving it a shot! Truly appreciate it!
Hey @avati
Sure, I'll give it a shot.
When you talk about subsampling batches, are you referring to the .sample
method?
Does the line_search
carry any randomness?
Ok! I think I got it working for the predictions
#Random Seed 23 V1
ngb = NGBRegressor(random_state=23).fit(X_train, Y_train)
Y_preds23v1 = ngb.predict(X_test)
#Random Seed 45
ngb = NGBRegressor(random_state=45).fit(X_train, Y_train)
Y_preds45 = ngb.predict(X_test)
#Try again with random Seed 23
ngb = NGBRegressor(random_state=23).fit(X_train, Y_train)
Y_preds23v2 = ngb.predict(X_test)
np.allclose(Y_preds45, Y_preds23v1) # Spits out False, as expected
np.allclose(Y_preds23v2, Y_preds23v1) # Spits out True, as expected :)
I also made it work with minibatch_frac
different from 1 ( NGBRegressor(random_state=23, minibatch_frac=0.3)
)
I figured the trick was to use random_state.choice
instead of np.random.choice
I've only tested for the .predict()
. Is there a test I should do for the distributions or something else? Let's think about a few tests that, if passed, would imply the pull request is ready
Hi @avati . Let me know what you think and I'll send a pull request
hey @cdagnino thanks for handling this and sorry for the delay! To answer your previous questions: 1) yes, the sample
method implements the batching and 2) the line search is indeed deterministic.
It seems like the results you're showing here confirm your solution works ok. If you send the PR I'm happy to just have a look and approve it if it all looks good- I don't presume that the internals are terribly complicated.
I will definitely also be implementing some unit tests in the next week, so if you have any suggestions feel free to post!
Hi @alejandroschuler I've sent the pull request
Thank you!
Hi,
I expected the training and predictions to be the same after setting
np.random.seed
Is there another seed I should set to obtain reproducible results?Below I have an example you can run. I'm using the current version from Github.