Closed edgBR closed 4 months ago
A few quick concerns.
predict_proba
method. This won't be most of them, but I can remember that it's not a 100% guarantee. This is not a reason not to move in this direction, but we should try and catch this. predict()
won't break pipelines? I recall sklearn working on that, but it does feel like a "new" feature. Can't we pass this as a parameter to ZeroInflatedRegressor
? Hi @koaning
Thanks for answering in such sort time.
from sklearn.utils import all_estimators
estimators = all_estimators()
for name, class_ in estimators:
if hasattr(class_, 'predict_proba'):
print(name)
To get which ones have or not have predict_proba(), or we can just raise a warning if in the self.classifier we can not find the attributes.
BR E
Adding my two cents on the discussion:
Not every classifier also implements a predict_proba method.
Can be overcome by checking availability of .predict_proba()
method with sklearn.utils.metaestimators.available_if
Can't we pass this as a parameter to
ZeroInflatedRegressor
?
Should this "process" be implemented as predict_proba
or another (better called) method of the class instead of in predict
itself?
risk predictions (probability * expected impact).
The "probabilities" out of predict_proba
are not always actual probabilities, and they should be calibrated.
Hi @FBruzzesi,
Thanks for pointing out the availble_if().
Regarding the calibration, the whole point of adding the predict_proba() method is to actually use it in CalibratedClassifierCV() with ZeroInflatedRegressor(). In order to calibrate a classifier with scikit learn the decision_function() or predict_proba() methods need to be implemented.
BR E
I probably wasn't clear in my reasoning.
Suppse you have a ZeroInflatedRegressor
that has a classifier which implements .predict_proba()
.
If I interpret this correctly:
I propose that we implement a predict_proba() for the classifier and then an argument to the predict() function to choose if we want absolute predictions or risk predictions (probability * expected impact).
The predict_proba()
of ZeroInflatedRegressor
is just an aliasing of the internal classifier (we can use available_if
decorator to make sure that it exists).
What I am not fully on is to return (probability * expected impact)
without calibrating first as this may lead to unreasonable "probabilities".
Similarly, the other risk I see is to have a regressor predict (expected impact) on out of sample data (namely the subset predicted to be 0).
Hi @FBruzzesi,
As you mentioned we can not do probability x expected impact if the classifier is not calibrated. So lets split this in 2.
Regarding:
Similarly, the other risk I see is to have a regressor predict (expected impact) on out of sample data (namely the subset predicted to be 0).
Would you mind to explain more in detail?
Sure! Let me try to write a pseudo code for what could be an idea:
class ZeroInflatedRegressor(...):
def __init__(self, classifier, regressor, predict_risk = False):
...
def fit(self, X, y):
# the usual checks
...
if self.predict_risk: # (and classifier allows predict_proba)
# train calibrated classifier
...
else:
# train classifier
...
non_zero_indices = np.where(self.classifier_.predict(X) == 1)[0]
# train regressor on X[non_zero_indices], y[non_zero_indices]
...
def predict(self, X):
# the usual checks
...
if self.predict_risk:
# proba * expected_impact
...
else:
# current implementation
...
However notice how the regressor is fitted only on those values that are predicted to be non-zero. As the documentation states as well:
The regressor is only trained on examples where the target is non-zero, which makes it easier for it to focus.
Yet I imagine that expected_impact = regressor.predict(X)
which also includes values that are predicted to be zero.
Now this is the core point of the issue: be able to have a prediction for such values. I just would like to point out that they are out of sample, meaning not seen during training.
I hope this makes sense, please let me know your thoughts š
Hi @FBruzzesi I think I got you.
However what is the exact "problem" with this. I understand that expected_impact = regressor.predict(X) and that in here we will predict for all of the indexes (no matter if X_n is below the threshold of the classifier) but still the final prediction will be weighted by the probability prediction of the classifier which technically for this X_n that will be 0 should be close to that.
If I am missing something fundamental please let me know.
BR E
Oooh you are right! If the model is good then these expected_impact
values should be multiplied by tiny probabilities and get close to zero!
For the interface, as @koaning suggested, to be compatible with scikit-learn Pipelines and CrossValidation classes, passing an argument to the ZeroInflatedRegressor
should be the preferred way to go.
Let us know if you'd like to pick it up š
Hey everyone, I am coming back to this and I am thinking how to name the method.
I don't think predict_proba
would be a good choice, as scikit-learn would expect something completely different, and at the end of the day the ZeroInflated is a regressor, not a classifier.
Options I have in mind could be:
score_samples
predict_risk
expected_impact
@koaning thoughts on this?
Hi @FBruzzesi thanks for retaking this discussion.
I went silent because our team did a PR to scikit learn to have a better metric to calibrate classifiers but even all tests passed and we added all the documentation we haven't got any feedback in 4months :(
I leave it here in case someone find this useful: https://github.com/scikit-learn/scikit-learn/pull/27997
We wanted to have this merged to be able to combine it with this potential feature in scikit Lego.
Regarding your method naming, to me predict_risk is the more intuitive one. But of course that might be because my industry bias (finance)
I am leaning towards score samples because that already exists in scikitlearn land. But I am worried about folks misinterpreting it ... I guess that's a documentation problem though.
Please explain clearly what you'd like to see added.
In any risk estimation task usually estimation of risk is calculated as probability of the situation happening x expected value of the situation happening. For instance, one can calculate the probability of a natural disaster (tornado) happening and the economical damage that this tornado will do i.e (1M$). Usually these kind of problems are solved via 2 part models: https://www.sciencedirect.com/science/article/pii/S2212094718301385. It is not the same a predicted property damage of 100000$ with a probability of tornado of 0.96 that a predicted property damage of 401000$ with a probability of 0.1.
Currently when predicting with the ZeroInflatedRegressor:
The prediction is made via:
np.where(self.classifier_.predict(X))[0]
I propose that we implement a predict_proba() for the classifier and then an argument to the predict() function to choose if we want absolute predictions or risk predictions (probability * expected impact).
This will actually make compatible the ZeroInflatedRegressor with the new CalibratedClassifierCV
I guess that the alternative will be to pass the calibrated classifier already fitted to the ZeroInflatedRegressor.