interpretml / interpret

Fit interpretable models. Explain blackbox machine learning.
https://interpret.ml/docs
MIT License
6.04k stars 714 forks source link

Validation or OOB score #497

Open DerWeh opened 4 months ago

DerWeh commented 4 months ago

Sklearn's HistGradientBoostingClassifier provides a validation_score_ if a validation fraction is used, likewise the RandomForestClassifier (optionally) provides an oob_score_ if the corresponding argument is set. These values are very useful to get (compared to cross-validation) cheep accuracy estimate (for the actual model, not models trained on the cross-validation splits).

EBM also uses bagging and a validation split, so similar metrics should be readily available. Would it be possible, to expose a corresponding option in the ExplainableBoostingClassifier?

paulbkoch commented 4 months ago

Thanks @DerWeh -- It's a great suggestion and I think this is something we should add, however it's a bit tricker to implement in our setting than might appear at first glance. The difference between EBMs and HistGradientBoostingClassifier / RandomForsestClassifier is that the EBM boosting process internally uses an approximate calculation of log loss for performance reasons. The approximation isn't particularly close to the actual log loss, but it works in our setting because we only care about relative changes in the value rather than the actual value. We have the ability to turn off the approximation from python, but since we use early stopping, we don't know ahead of time which bosting iteration will end up being the best. At the point where we know that information, we've already overwritten the information in the C++ framework to re-run the log loss calculation on the earlier boosting iteration that ended up being the best one.

The best way to implement this would therefore be to have the python code extract the resulting "best iteration" model, and then call back into a new C++ API that would do the exact log loss calculation using the information that we would need that still resides in python.

I'll leave this item in our backlog for future implementation and/or as a PR.

However, if you really want to do this today, it's possible to perform this calculation yourself. The first thing you would need to do is define the outer bags yourself so that you know exactly what the validation sets were. You can do this through the "bags" parameter of the fit function:

https://interpret.ml/docs/python/api/ExplainableBoostingClassifier.html#interpret.glassbox.ExplainableBoostingClassifier.fit

After the model is constructed, you then need to know what the individual bagged models were. This is available through the "baggedscores" and "baggedintercept" attributes. If you did the following assignment:

ebm0 = ebm.copy() ebm0.termscores = [scores[0] for scores in ebm.baggedscores] ebm0.intercept = ebm.baggedintercept[0]

Then ebm0 would predict identically to the 0th outer bagged model. Put this assignment in a loop and you can have access to all the outer bagged models to evaluate on the validation sets.

DerWeh commented 4 months ago

Thanks for the valuable insight. I think it would be possible to add this (kind of workaround) in the Python interface.

During the fit, the bags are known. At the end of the fitting routine, we could conditionally do the prediction on out of bag samples and calculate and OOB score. I assume the overhead to be quite negligible as prediction with EBMs is fast while fitting itself is slow.

If you agree, I would be willing to try creating a PR. The implementations seem rather straight forward, so the project structure is a bit unwieldy. The question is how to name the argument (oob_score like in the RFC?) and what to use as a scoring function.

Another question would be, if we should simply store the bags after fitting, to allow for such post-processing by the user. I am not sure how much the memory overhead would be (currently no Python at hand to test it).

paulbkoch commented 4 months ago

Hi @DerWeh -- I agree that the overhead of calculating the OOB score would be negligible. For log loss on classification it would be reasonable to make the calculation in python, however when there are alternative loss functions then the only place we currently calculate metrics for those is in C++. For "tweedie_deviance" as an example, we calculate it here:

https://github.com/interpretml/interpret/blob/4d996d56c4ce955b0cdae42ff888a002bbe18aee/shared/libebm/compute/objectives/TweedieDevianceRegressionObjective.hpp#L87

And in the more general case, if someone wants to write their own more optimized metric for a custom loss, it would go here: https://github.com/interpretml/interpret/blob/4d996d56c4ce955b0cdae42ff888a002bbe18aee/shared/libebm/compute/objectives/ExampleRegressionObjective.hpp#L81

There is a pure python way to handle metrics on alternative losses rather than writing a new C++ API. It might even be more desirable in fact since it would avoid adding complexity in C++. If you create a new booster object, and ask it to apply an empty update, it will at that time calculate the metric on the OOB data for whatever loss function was used.

The python code to do this would share a lot in common with the "boost" function here: https://github.com/interpretml/interpret/blob/develop/python/interpret-core/interpret/glassbox/_ebm/_boost.py

Except that it wouldn't call the C++ function "generate_term_update", and it would instead call set_term_update followed by apply_term_update similar to here: https://github.com/interpretml/interpret/blob/4d996d56c4ce955b0cdae42ff888a002bbe18aee/python/interpret-core/interpret/glassbox/_ebm/_boost.py#L125-L127

Then you'll get back the metric. But this metric will by default still be the approximate log loss. You can get the real log loss by creating the booster with this flag: https://github.com/interpretml/interpret/blob/4d996d56c4ce955b0cdae42ff888a002bbe18aee/python/interpret-core/interpret/utils/_native.py#L26

Which gets passed in here: https://github.com/interpretml/interpret/blob/4d996d56c4ce955b0cdae42ff888a002bbe18aee/python/interpret-core/interpret/glassbox/_ebm/_boost.py#L46

Sorry for the unwieldiness of this section. EBMs are extremely computationally intensive, so we have some complicated aspects that exist for performance reasons.

I'd probably name any such attribute something like "oob_metric". It would be nice to be able to use the same naming as other ML packages, but we use "score" all over the place for the additive values, so it's likely to be less confusing with an alternative name like "metric".

In terms of preserving the bags, I'd probably opt for keeping them out of the model since it's already possible for the caller to achieve the desired effect using the existing bags parameter and probably 2-3 lines of additional code. Keeping the models small/compact and somewhat intelligible in JSON is a plus, so I tend to steer towards preserving simplicity in these cases. I could be convinced otherwise though if there was high demand for preserving them.