koaning / scikit-lego

Extra blocks for scikit-learn pipelines.
https://koaning.github.io/scikit-lego/
MIT License
1.28k stars 117 forks source link

[FEATURE] Implement predict proba in ZeroInflatedRegressor() #585

Closed edgBR closed 4 months ago

edgBR commented 1 year ago

Please explain clearly what you'd like to see added.

In any risk estimation task usually estimation of risk is calculated as probability of the situation happening x expected value of the situation happening. For instance, one can calculate the probability of a natural disaster (tornado) happening and the economical damage that this tornado will do i.e (1M$). Usually these kind of problems are solved via 2 part models: https://www.sciencedirect.com/science/article/pii/S2212094718301385. It is not the same a predicted property damage of 100000$ with a probability of tornado of 0.96 that a predicted property damage of 401000$ with a probability of 0.1.

Currently when predicting with the ZeroInflatedRegressor:

image

The prediction is made via:

np.where(self.classifier_.predict(X))[0]

I propose that we implement a predict_proba() for the classifier and then an argument to the predict() function to choose if we want absolute predictions or risk predictions (probability * expected impact).

This will actually make compatible the ZeroInflatedRegressor with the new CalibratedClassifierCV

I guess that the alternative will be to pass the calibrated classifier already fitted to the ZeroInflatedRegressor.

koaning commented 1 year ago

A few quick concerns.

  1. Not every classifier also implements a predict_proba method. This won't be most of them, but I can remember that it's not a 100% guarantee. This is not a reason not to move in this direction, but we should try and catch this.
  2. Are you sure that adding an argument to predict() won't break pipelines? I recall sklearn working on that, but it does feel like a "new" feature. Can't we pass this as a parameter to ZeroInflatedRegressor?
edgBR commented 1 year ago

Hi @koaning

Thanks for answering in such sort time.

  1. Regarding the first point, you are right. Not all classifiers predict probabilities. But we can use:
from sklearn.utils import all_estimators

estimators = all_estimators()

for name, class_ in estimators:
    if hasattr(class_, 'predict_proba'):
        print(name)

To get which ones have or not have predict_proba(), or we can just raise a warning if in the self.classifier we can not find the attributes.

  1. You might be right, as you pointed out passing it as parameter whenever we initialize the Zero-inflated Regressor might be a better idea.

BR E

FBruzzesi commented 1 year ago

Adding my two cents on the discussion:

Not every classifier also implements a predict_proba method.

Can be overcome by checking availability of .predict_proba() method with sklearn.utils.metaestimators.available_if

Can't we pass this as a parameter to ZeroInflatedRegressor?

Should this "process" be implemented as predict_proba or another (better called) method of the class instead of in predict itself?

risk predictions (probability * expected impact).

The "probabilities" out of predict_proba are not always actual probabilities, and they should be calibrated.

edgBR commented 1 year ago

Hi @FBruzzesi,

Thanks for pointing out the availble_if().

Regarding the calibration, the whole point of adding the predict_proba() method is to actually use it in CalibratedClassifierCV() with ZeroInflatedRegressor(). In order to calibrate a classifier with scikit learn the decision_function() or predict_proba() methods need to be implemented.

BR E

FBruzzesi commented 1 year ago

I probably wasn't clear in my reasoning. Suppse you have a ZeroInflatedRegressor that has a classifier which implements .predict_proba().

If I interpret this correctly:

I propose that we implement a predict_proba() for the classifier and then an argument to the predict() function to choose if we want absolute predictions or risk predictions (probability * expected impact).

The predict_proba() of ZeroInflatedRegressor is just an aliasing of the internal classifier (we can use available_if decorator to make sure that it exists).

What I am not fully on is to return (probability * expected impact) without calibrating first as this may lead to unreasonable "probabilities".

Similarly, the other risk I see is to have a regressor predict (expected impact) on out of sample data (namely the subset predicted to be 0).

edgBR commented 1 year ago

Hi @FBruzzesi,

As you mentioned we can not do probability x expected impact if the classifier is not calibrated. So lets split this in 2.

  1. predict_proba() is implemented in order to be combined with calibrated classifier. In this case we can have a calibrated classifier within the ZeroInflatedRegressor.
  2. The user is in charge of calculating the risk itself as right now there is to my knowledge no metadata regarding if a classifier is calibrated or not. If there is a way to check if the classifier is calibrated then we could return the risk prediction, if the classifier is not calibrated we raise an error.

Regarding:

Similarly, the other risk I see is to have a regressor predict (expected impact) on out of sample data (namely the subset predicted to be 0).

Would you mind to explain more in detail?

FBruzzesi commented 1 year ago

Sure! Let me try to write a pseudo code for what could be an idea:

class ZeroInflatedRegressor(...):

    def __init__(self, classifier, regressor, predict_risk = False):
        ...

    def fit(self, X, y):
         # the usual checks
         ...
         if self.predict_risk: # (and classifier allows predict_proba)
              # train calibrated classifier
              ...
         else: 
              # train classifier
              ...

         non_zero_indices = np.where(self.classifier_.predict(X) == 1)[0]
         # train regressor on X[non_zero_indices], y[non_zero_indices]
         ...

     def predict(self, X):
          # the usual checks
          ...
          if self.predict_risk:
               # proba * expected_impact
               ...
          else:
              # current implementation
              ...

However notice how the regressor is fitted only on those values that are predicted to be non-zero. As the documentation states as well:

The regressor is only trained on examples where the target is non-zero, which makes it easier for it to focus.

Yet I imagine that expected_impact = regressor.predict(X) which also includes values that are predicted to be zero.

Now this is the core point of the issue: be able to have a prediction for such values. I just would like to point out that they are out of sample, meaning not seen during training.

I hope this makes sense, please let me know your thoughts šŸ˜Š

edgBR commented 1 year ago

Hi @FBruzzesi I think I got you.

However what is the exact "problem" with this. I understand that expected_impact = regressor.predict(X) and that in here we will predict for all of the indexes (no matter if X_n is below the threshold of the classifier) but still the final prediction will be weighted by the probability prediction of the classifier which technically for this X_n that will be 0 should be close to that.

If I am missing something fundamental please let me know.

BR E

FBruzzesi commented 1 year ago

Oooh you are right! If the model is good then these expected_impact values should be multiplied by tiny probabilities and get close to zero!

For the interface, as @koaning suggested, to be compatible with scikit-learn Pipelines and CrossValidation classes, passing an argument to the ZeroInflatedRegressor should be the preferred way to go.

Let us know if you'd like to pick it up šŸ˜Š

FBruzzesi commented 5 months ago

Hey everyone, I am coming back to this and I am thinking how to name the method.

I don't think predict_proba would be a good choice, as scikit-learn would expect something completely different, and at the end of the day the ZeroInflated is a regressor, not a classifier.

Options I have in mind could be:

@koaning thoughts on this?

edgBR commented 5 months ago

Hi @FBruzzesi thanks for retaking this discussion.

I went silent because our team did a PR to scikit learn to have a better metric to calibrate classifiers but even all tests passed and we added all the documentation we haven't got any feedback in 4months :(

I leave it here in case someone find this useful: https://github.com/scikit-learn/scikit-learn/pull/27997

We wanted to have this merged to be able to combine it with this potential feature in scikit Lego.

Regarding your method naming, to me predict_risk is the more intuitive one. But of course that might be because my industry bias (finance)

koaning commented 5 months ago

I am leaning towards score samples because that already exists in scikitlearn land. But I am worried about folks misinterpreting it ... I guess that's a documentation problem though.