stanfordmlgroup / ngboost

Natural Gradient Boosting for Probabilistic Prediction
Apache License 2.0
1.65k stars 215 forks source link

Is it possible to use sample weights, or any plans to add a sample_weight parameter? #50

Closed kmedved closed 4 years ago

kmedved commented 4 years ago

This is an amazing project and I have high hopes for using ngboost in my work. I don't currently see any sample_weight functionality. Are there any plans to add this? (I apologize, as I lack the technical expertise to do it myself).

alejandroschuler commented 4 years ago

Great suggestion. I just implemented this in https://github.com/stanfordmlgroup/ngboost/commit/f108d672e62633ca5cea90720c794953b77044b5. Feel free to try it out and let me know if it works for you! You should be able to use it like:

X, y = load_breast_cancer(True)
X_train, X_test, Y_train, Y_test  = train_test_split(X, y, test_size=0.2)
weight = np.random.random(Y_train.shape) # for testing

ngb = NGBClassifier()
ngb.fit(X_train, Y_train, sample_weight=weights)

You can even use weights for the validation set, and in combination with early stopping:

X, y = load_breast_cancer(True)
X_train, X_test, Y_train, Y_test  = train_test_split(X, y, test_size=0.2)

weight = np.random.random(Y_train.shape)
val_weight = np.random.random(Y_test.shape)

ngb = NGBClassifier()
# early stopping stops the fitting if the validation loss has not gone under the past minimum for more than K (K=10 here) iterations
ngb.fit(X_train, Y_train, X_val=X_test, Y_val=Y_test, sample_weight=weight, val_sample_weight=val_weight, early_stopping_rounds=10)

These are classification examples but it should work the same for regression.

alejandroschuler commented 4 years ago

I should note that the model initialization does not use the sample weights since scipy.stats distributions don't accept sample weights in their fit() methods, unfortunately. The workaround to that is to overwrite them with our own that do, but I'm not sure it's worth the effort. Initialization is somewhat arbitrary anyways.

kmedved commented 4 years ago

Lack of initialization is fine I think. From early testing, this seems to work great. I will do further testing in the next week to confirm the results.

Thanks!