predict_proba - Githubissues

imoscovitz / wittgenstein

Ruleset covering algorithms for transparent machine learning

MIT License

95 stars 26 forks source link

predict_proba #2

Closed flamby closed 1 year ago

flamby commented 5 years ago

Hi,

First of all, thanks for this lib. I'm currently evaluating it for very imbalanced data set (continuous).

Is predict_proba possible w/ such ruleset based predictions? i.e. in order to leverage a threshold other than 0.5 to improve precision for instance

imoscovitz commented 5 years ago

Hi flamby,

You're very welcome -- thanks for using it!

Good idea -- predict_proba would be a useful feature. It should be possible -- sklearn implements a DecisionTree predict_proba as "the fraction of samples of the same class in a leaf." https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html#sklearn.tree.DecisionTreeClassifier.predict_proba

I'm working through and prototype-coding what the statistically-sound equivalent would be for a Ruleset. Will try to give you an update on how it's going or a package update sometime next week.

Otherwise, how is it working out for you so far?

Thanks, -Ilan

flamby commented 5 years ago

Hi @imoscovitz,

Good to hear that it is achievable ;-)

I'm currently grid-searching with IREP the best n_discretize_bins & prune_size combo, using celery tasks to mitigate the current monothreading of the algorithm. RIPPER is too slow for my current hardware setup to be used. But on a data subset, I too had better result with it, around 3 or 4 points more in precision compared to IREP. I'm also evaluating skope-rules

Thanks and keep the good work!

imoscovitz commented 5 years ago

Hi flamby,

The core of predict_proba is finished. I was worried it would be both time and memory-intensive, but I found a simple, fast solution and also double-checked its validity with a mathematician. At this point I'm mostly testing and deciding how to handle edge cases -- update should be out this week!

That's interesting -- are you noticing a performance difference with different grid parameters? (And are you using sklearn's GridSearchCV, hopefully with no hitches?) I may add a more sophisticated discretization algorithm if the binning seems to be important for performance.
RIPPER's optimization stage (the k parameter) is baroque and computationally painful. There are some speedups I can try, but I'm also tempted to devise a parameter that allows you to approximate that step in order to bring its Big-O back in-line with IREP's. (You might be able to capture some of the performance boost without as much slowdown by setting k=0.)
How do you like celery and skope?

Thanks! Ilan

flamby commented 5 years ago

Hi @imoscovitz

Once your predict_proba implementation is committed, I should be able to give it a try within one week.

My custom grid search (using celery async tasks, and not GridSearchCV) is still running ;-) From partial results, different n_discretize_bins do matter a lot, but not pruning. At least, w/ my dataset. I'll try ripper w/ k=0 later this week and keep you updated.

Regarding skope, I've not better results for now, but I didn't play with all its parameters/options yet.

I'll give rulefit a try too, after having read its author's book on ML interpretability (very good book BTW on leanpub)

flamby commented 5 years ago

Hi @imoscovitz

I gave ripper w/ k=0 a try, but it's still one or two order of magnitude slower, and strangely, I did not have better results than IREP. But since I had the chance to test it w/ only a data subset, we can't say for sure. Any plan to multiprocess those algorithms? I guess it's not that easy.

imoscovitz commented 5 years ago

Hi @flamby,

I'll open up a dedicated issue for speed optimizations, since it's worth improving and there are several possible strategies.

predict_proba has been committed. (Headsup: if you have a very simple model, it will predict fewer distinct proba values. Also, all negative predictions will have the same proba values (unless you train a separate model to identify negatives)).
I also created a method called _refit_proba. The idea is to allow you to refit the model's probabilities on a second, fresh training set, if you wish, to potentially improve the accuracy of its probas/confidence. It runs fine, but I need to examine more whether refitting makes sense.

Curious to hear what you think about them. Are you able to use the version I committed to github? That way if you have feedback on how it's working out for you, I'd be able to include it in the version update.

Thanks again! Ilan

flamby commented 5 years ago

Hi @imoscovitz

Thanks for the commit.

I haven't tested yet the predict_proba. I will this week. I tested your last commit for the new args max_rules and the 2 others, but it did not improve performance, but I need to grid search them w/ a bigger search space.

Regarding RIPPER w/ k=0 or k=2 or more, I always have worst results than IREP.

I've found a features set (after having done lots of features selection using various methods) that suits well IREP, and will use it for predict_proba, hopping to improve precision ; i've very good recall and quite good precision. Since I've this restricted features set, I'm able to do IREP tests quicker, so I should be able to explore more the search space of all your new parameters.

Regarding _refit_proba, could you give an example usage w/ a synthetic dataset of your choice? I don't quite see how to adapt it to my use case.

Thanks

imoscovitz commented 5 years ago

Hi @flamby,

Great, thank you! Looking forward to hearing what you think of the proba capabilities I added.

A few thoughts that could be helpful for the problems you're working on:

Very glad to hear you're getting good recall and precision. One tip -- while controlling max isn't specifically intended to improve performance, because a ruleset is on the top-level a set of 'ORs', a larger ruleset (in terms of max_rules), and with all things like random_state being equal, should always improve recall, and a smaller one is likely to improve precision and training speed.
That's very interesting that RIPPER sometimes give you worse results than IREP -- RIPPER is a much more sophisticated algorithm -- it's core is IREP with all kinds of performance optimizations layered into it. I'll have to think about why specifically k=0 and k>2 behave that way. Some of the metrics it uses internally are different (and in one case more similar to C4.5 trees), which maybe aren't as suited to your domain. Thanks for letting me know. I wonder, are there any noticeable structural differences such as number of conds per rule or number of rules in the ruleset that RIPPER generates that might be influencing that? Or is it just that it's picking different conditions? (To check, the code is comparing each one's len(clf.ruleset) as well as their clf.ruleset.countconds() / len(clf.ruleset.rules). (FYI, I don't particularly think this will help much, but clf.RIPPER(dl_allowance=64) is an information-based parameter which a smaller number may result in choosing an earlier minimum for stopping new rule generation, rather than blindly choosing max_rules. I'm not super confident in it because I suspect it would be very sensitive to data noise level.)
If grid-searching RIPPER causes you speed issues, you could test RIPPER on a subset of your dataset with clf.RIPPER(verbosity= at least 1) to check which stage is causing you the most pain, speed-wise. I have a speed optimization for RIPPER's model optimization stage I'm considering implementing, which I think could improve that stage's big-O by a factor of the number of rules. I can test this on my own too, but just in case you want to tell me if that's the part that's hurting you the most I can try to prioritize that optimization.
Are you asking to see an example of _refit_proba usage? Here's a summary of the code I used in a test-run:

df = pd.read_csv('datasets/adult.csv') # aka census: https://archive.ics.uci.edu/ml/datasets/adult 
... # train_test_split, etc.

dsize = int(len(train)//.66) # train model on first-third of training set
clf = lw.IREP()
clf.fit(train.head(dsize), class_feat='income', pos_class='>50K', random_state=0)

(array([[0.1474701 , 0.8525299 ],
        [0.1474701 , 0.8525299 ],
        [0.1474701 , 0.8525299 ],
        ...,
        [0.1474701 , 0.8525299 ],
        [0.1474701 , 0.8525299 ],
        [0.74596432, 0.25403568]]),)

clf._refit_proba(test.tail(dsize), min_samples=None, require_min_samples=False) # train probabilities on final third of training set
clf.predict_proba(test)

(array([[0.15483516, 0.84516484],
        [0.15483516, 0.84516484],
        [0.15483516, 0.84516484],
        ...,
        [0.15483516, 0.84516484],
        [0.15483516, 0.84516484],
        [0.72429907, 0.27570093]]),)

refit is an unusual idea, but so far I believe it could make sense to do given enough data -- one analogy might be how random forest makes sure to choose its confidence in each tree (which is like an estimate of model accuracy--i.e. probas) with out of the bag samples so there's no information leak.

Finally, if you don't mind, are there any feature selection techniques that you've found especially helpful? Also, I'd be curious to know what domain and/or dataset length/dimensionality you're working with so I can understand how it's working better.

Thanks so much! I hope that's helpful (and not too many questions :) -Ilan

flamby commented 5 years ago

Hi @imoscovitz

Very glad to hear you're getting good recall and precision. One tip -- while controlling max isn't specifically intended to improve performance, because a ruleset is on the top-level a set of 'ORs', a larger ruleset (in terms of max_rules), and with all things like random_state being equal, should always improve recall, and a smaller one is likely to improve precision and training speed.

Makes sense. I plan to use those max args when I'll have to deal w/ interpretability.

That's very interesting that RIPPER sometimes give you worse results than IREP -- RIPPER is a much more sophisticated algorithm -- it's core is IREP with all kinds of performance optimizations layered into it.

Too early to conclude. I might have opposite results w/ other datasets.

I'll have to think about why specifically k=0 and k>2 behave that way. Some of the metrics it uses internally are different (and in one case more similar to C4.5 trees), which maybe aren't as suited to your domain. Thanks for letting me know. I wonder, are there any noticeable structural differences such as number of conds per rule or number of rules in the ruleset that RIPPER generates that might be influencing that? Or is it just that it's picking different conditions? (To check, the code is comparing each one's len(clf.ruleset) as well as their clf.ruleset.countconds() / len(clf.ruleset.rules).

TBH, I did not dig into the generated rulesets as I should have to understand what's going on under the hood, as I'm busy w/ other things. So let's skip this ripper's discrepancy for now.

(FYI, I don't particularly think this will help much, but clf.RIPPER(dl_allowance=64) is an information-based parameter which a smaller number may result in choosing an earlier minimum for stopping new rule generation, rather than blindly choosing max_rules. I'm not super confident in it because I suspect it would be very sensitive to data noise level.)

I'll test this arg as I'm pretty sure lots of noise in the data I have ;-)

If grid-searching RIPPER causes you speed issues, you could test RIPPER on a subset of your dataset with clf.RIPPER(verbosity= at least 1) to check which stage is causing you the most pain, speed-wise.

I will. Thanks.

I have a speed optimization for RIPPER's model optimization stage I'm considering implementing, which I think could improve that stage's big-O by a factor of the number of rules. I can test this on my own too, but just in case you want to tell me if that's the part that's hurting you the most I can try to prioritize that optimization.

Are you asking to see an example of _refit_proba usage? Here's a summary of the code I used in a test-run:

Yes. That's example is perfect.

Finally, if you don't mind, are there any feature selection techniques that you've found especially helpful? Also, I'd be curious to know what domain and/or dataset length/dimensionality you're working with so I can understand how it's working better.

It's for some financial datasets, so trying to predict random walk to roughly summarize ;-) Lately, I'm using SequentialFeatureSelector from mlextend lib, and also StabilitySelection (w/ less success). I tried a lot of features selection methods (model-based, model-agnostic, etc.), and these two are well suited w/ my datasets.

What I do like a lot w/ IREP, is the fact that, when I add lots of features w/ features engineering, I never have less accuracy (precision/recall) than before. Most other algorithms that I used (tree-based, RNN, etc.) do not like more features, particularly when those features are highly correlated. So maybe after having observed the generated rulesets, I'll have my features selection for free ;-)

Thanks so much! I hope that's helpful (and not too many questions :)

Thanks to you!

imoscovitz commented 5 years ago

It's for some financial datasets, so trying to predict random walk to roughly summarize ;-)

LOL What I do like a lot w/ IREP, is the fact that, when I add lots of features w/ features engineering, I never have less accuracy (precision/recall) than before. Most other algorithms that I used (tree-based, RNN, etc.) do not like more features, particularly when those features are highly correlated.

That's really fascinating and awesome: Trees/RF are supposed to be pretty good with correlated feats, but perhaps with complex/noisy datasets, instability/overfitting can cause feature issues that IREP manages to avoid.

I may implement some feat-selection capabilities down-the-road (0.1.8 or 0.1.9?), for speed and NLP, which can easily have tens of thousands of feats. Yes. That's example is perfect.

Great! Excited to hear how predict_proba and _refit_proba work out for you.

Thanks! Ilan

flamby commented 5 years ago

Hi @imoscovitz,

That's really fascinating and awesome: Trees/RF are supposed to be pretty good with correlated feats, but perhaps with complex/noisy datasets, instability/overfitting can cause feature issues that IREP manages to avoid.

That's my assumption too. Since overfitting is the real devil in ML w/ financial data, RIPPER might overfit like we see for boosting algorithms, contrary to bagging. But I can reproduce that only if starting with lots of features ; otherwise RIPPER is a little better in precision, and much better in recall. So, one could use IREP to do features selection, and then run RIPPER w/ selected features, to save some computing power.

I may implement some feat-selection capabilities down-the-road (0.1.8 or 0.1.9?), for speed and NLP, which can easily have tens of thousands of feats.

Right now, I rely on the following snippet to retrieve the features used in the generated rules : features = list(set([cond.feature for rule in classifier.ruleset_.rules for cond in rule.conds])) Ranking them will be useful for sure. Starting from more than 2k features, and ending up w/ only 16, w/ a good precision and very high recall is impressive. Add to this that the very simple interpretability helps gaining domain knowledge for free.

Great! Excited to hear how predict_proba and _refit_proba work out for you.

I had time to play w/ predict_proba. See by yourself, with IREP, Precision and Recall Scores as a function of the decision threshold for one dataset with 6 different features set (data viz code from this tds article)

I also compared RIPPER vs IREP on a dataset w/ only 19 features :

RIPPER used only one feature, thus created one rule with one condition (left in the diagram below)
IREP used 3 features, created 5 rules with one or two conditions (right)
The naïve conclusion for me is that RIPPER is doing a better job at generalizing.

How do you understand this diagram regarding the decision threshold discontinuity?

Update regarding _refit_proba_, I cannot reproduce. Note: I used IREP. Whatever I do, I always get the very same probabilities than before the refit. I tried to refit the classifier w/ a different dataset as well (mimicking transfer learning so to speak), but the resulting classifier object is still the same (compared a copy of classifier before, and after the refit), and the generated probabilities too. BTW, is there any reason you return a tuple containing the probabilities array? sklearn api does not do that iirc.

Could this _refit_proba be sensitive to the data?

Whatever the reason (my implementation, my data, etc.), the idea is compelling. Since I've good confidence in the generalization of this algorithm, I'll keep for now training a classifier per dataset.

Keep the good work! And I'll be happy to test the perf improvement.

imoscovitz commented 5 years ago

Oh no! I had responded to this comment but must have forgotten to hit comment and didn't see your update :/ Apologies!

The discontinuity in your RIPPER graph could have to do with the fact that it's a small model -- small models will predict very few distinct probabilities because they treat examples covered by the same combination of rules in the same way. (The number of possible probas is going to be at most the same as the number of possible rule combinations in your model, (+1 for negative, which will always get the same proba unless you train a model that can predict negatives))
The classifier object shouldn't change when you recalibrate because it doesn't retrain the model, but the probabilities should change.
Refit works for me -- I tend to get slightly different probabilities. One possibility: did refit generate a warning? It's a little tricky, so I tried to explain in the docstring, but if there aren't enough samples to recalibrate each rule, it will not recalibrate any of them so they're not out of whack with each other. To get around this, you can set the minimum number of examples refit will accept with the param min_samples, or tell it to go through with refitting the ones with enough samples and ignore those that don't, using require_min_samples. Is your refitting getting tripped up by that? It should generate a massive warning.
Just published the updates to pypi, which you can use to pip upgrade wittgenstein. Renamed _refit_proba as recalibrate_proba, and added more to the proba docstrings.
Let me know how well it works out for you -- if there are issues, happy to bugfix.

Thanks! Ilan

flamby commented 5 years ago

Oh no! I had responded to this comment but must have forgotten to hit comment and didn't see your update :/

No problem. You could have been busy too ;-)

The discontinuity in your RIPPER graph could have to do with the fact that it's a small model -- small models will predict very few distinct probabilities because they treat examples covered by the same combination of rules in the same way. (The number of possible probas is going to be at most the same as the number of possible rule combinations in your model, (+1 for negative, which will always get the same proba unless you train a model that can predict negatives))

I see. Indeed, I have very simple generated model, which I like a lot since it's good sign for not overfitting.

Something that puzzle me is that, if I use predict_proba w/ a 0.5 decision threshold, I don't get the same result (precision, recall in my case) than w/ predict. But I guess it's due to the very nature of the way you have to compute those probabilities w/ rule-based algorithms, right?

So I rely on the following snippet to guess the best threshold (minus 1e-5) using sklearn precision_recall_curve

def get_best_score(y_test, probas):
    precisions, recalls, thresholds = precision_recall_curve(y_test, probas[:, 1])
    best_precision, best_recall, best_threshold, best_index = 0, 0, 0, None
    for i, _p in enumerate(precisions):
        if recalls[i]:
            if (not best_precision and _p) or _p > best_precision:
                best_precision = _p
                best_recall = recalls[i]
                best_threshold = thresholds[i]
                best_index = i

    return {"precision": best_precision, "recall": best_recall,
            "threshold": best_threshold, "index": best_index}

The classifier object shouldn't change when you recalibrate because it doesn't retrain the model, but the probabilities should change.

I see. Thanks for the clarification.

Refit works for me -- I tend to get slightly different probabilities. One possibility: did refit generate a warning? It's a little tricky, so I tried to explain in the docstring, but if there aren't enough samples to recalibrate each rule, it will not recalibrate any of them so they're not out of whack with each other. To get around this, you can set the minimum number of examples refit will accept with the param min_samples, or tell it to go through with refitting the ones with enough samples and ignore those that don't, using require_min_samples. Is your refitting getting tripped up by that? It should generate a massive warning.

It does not generate warnings. So I'll give it a try again. I must have implemented it badly.

Just published the updates to pypi, which you can use to pip upgrade wittgenstein. Renamed _refit_proba as recalibrate_proba, and added more to the proba docstrings.

Thanks.

Any progress on the optimization front?

imoscovitz commented 5 years ago

Hi flamby,

No problem. You could have been busy too ;-)

That's also true -- I've been applying to jobs ;-)

Something that puzzle me is that, if I use predict_proba w/ a 0.5 decision threshold, I don't get the same result (precision, recall in my case) than w/ predict.

That's definitely possible for recalibrated models (the docstrings for recalibration talk about it a little bit because I thought it would be a little unexpected for people), but I hadn't considered it could happen for original ones on IREP. RIPPER could be another story. Is this happening after you recalibrated probas, or are you predicting off the original; and is it with IREP or RIPPER?

You can also access what's going on by running [(r, r.class_freqs) for r in clf.ruleset_.rules] (and clf.ruleset_.uncovered_class_freqs for the neg probability) -- those are the numbers that the model stores for generating probabilities after training and after recalibrating.

I rely on the following snippet to guess the best threshold (minus 1e-5) using sklearn precision_recall_curve

Are you evaluating with area under the precision recall curve, eyeballing what graph looks best to you, or making a domain-specific choice?

It does not generate warnings. So I'll give it a try again.

Hmm, yeah let me know. If you're recalibrating with a different training set from the one you used to train your model, either the probabilities should almost certainly change, or it should throw warnings if it can't find enough samples to work with. If neither happened (and you haven't turned off warnings), then there's a problem. Either way, it sounds like it could be made clearer.

Any progress on the optimization front?

Yes! :) I'll make a quick post to the optimization issue.

flamby commented 5 years ago

Hi @imoscovitz

That's definitely possible for recalibrated models (the docstrings for recalibration talk about it a little bit because I thought it would be a little unexpected for people), but I hadn't considered it could happen for original ones on IREP. RIPPER could be another story. Is this happening after you recalibrated probas, or are you predicting off the original; and is it with IREP or RIPPER?

It is IREP, and with or without recalibration.

You can also access what's going on by running [(r, r.class_freqs) for r in clf.ruleset_.rules] (and clf.ruleset_.uncovered_class_freqs for the neg probability) -- those are the numbers that the model stores for generating probabilities after training and after recalibrating.

Interesting. So this is where the magic happens.

Are you evaluating with area under the precision recall curve, eyeballing what graph looks best to you, or making a domain-specific choice?

For domain-specific choice. Basically, precision is über important in my case, so the decision threshold helps improving it, without degrading recall as much as with other algorithm (DecisionTree, RandomForest or worst : boosting algorithms).

Hmm, yeah let me know. If you're recalibrating with a different training set from the one you used to train your model, either the probabilities should almost certainly change, or it should throw warnings if it can't find enough samples to work with. If neither happened (and you haven't turned off warnings), then there's a problem. Either way, it sounds like it could be made clearer.

I think I fixed the issue since I now have slightly different probabilities. However predict result is always the same and I have lots of NaN in the probabilities after recalibration :

np.argwhere(np.isnan(probas))
>>>array([], shape=(0, 2), dtype=int64)

np.argwhere(np.isnan(probas_with_refit))
>>>array([[ 177,    0],
>>>       [ 177,    1],
[...]

I guess it's not the expected behavior, right? ;-)

imoscovitz commented 5 years ago

Interesting. So this is where the magic happens.... However predict result is always the same and I have lots of NaN in the probabilities after recalibration... I guess it's not the expected behavior, right? ;-)

Haha, hopefully not :) What's your class balance? Are you only getting NaNs after recalibration, or before too? And what parameters are you using for recalibrate?
A little background on predict/predict_proba: The Rule.class_freqs and proba are things I added in order to make predict_proba calculations; they occur after model training and aren't specifically used during training. Predict result should stay the same as before recalibration -- predict is based on the underlying, trained boolean model, and recalibration doesn't affect the model, only post-hoc probabilities/confidences it tacks on to the different rules/rule combos after training. Using predict proba is a little like each rule is a tree in a random forest, and each rule's probas are the confidence it has in each tree, which it uses to predict proba at the end. When you recalibrate on fresh training data, it's like remeasuring its confidence in each rule/tree with out of the bag data.

flamby commented 5 years ago

Haha, hopefully not :) What's your class balance? Are you only getting NaNs after recalibration, or before too? And what parameters are you using for recalibrate?

class balance :

train["Target"].value_counts(normalize=True)
0    0.541023
1    0.458977

I'm getting NaNs only after recalibration, and using following parameters min_samples=None, require_min_samples=False, discretize=True

But I think I tried the default ones, and discretize=False as well. Anyway, I'll give it another try.

A little background on predict/predict_proba: The Rule.class_freqs and proba are things I added in order to make predict_proba calculations; they occur after model training and aren't specifically used during training. Predict result should stay the same as before recalibration -- predict is based on the underlying, trained boolean model, and recalibration doesn't affect the model, only post-hoc probabilities/confidences it tacks on to the different rules/rule combos after training. Using predict proba is a little like each rule is a tree in a random forest, and each rule's probas are the confidence it has in each tree, which it uses to predict proba at the end. When you recalibrate on fresh training data, it's like remeasuring its confidence in each rule/tree with out of the bag data.

Thanks for the clarification. It's obvious now with this explanation.

flamby commented 5 years ago

Hi @imoscovitz

I've been able to nail down that the NaNs are generated only when using IREP and not RIPPER, on a very small subset of the IREP models I generated (by thousands).

I can reproduce it at will on some of these IREP models, even for predict_proba regular inference (i.e. without recalibration).

My first guess is that it could be related to the fact that no rows in the inference data match the IREP rules, leading to NaN in the probabilities, whereas we should expect simply zeros as probabilities for the positive class, and that I've just been lucky that my RIPPER models - being more complex probably - match the rules.

If that's so, it's just a little bug when no rules match the data.

Do you think the way you compute probabilities could lead to a such behaviour?

Thanks!

imoscovitz commented 5 years ago

Thanks, @flamby,

I was able to reproduce what you said. The nan's seem to be appearing when the training set has no positive examples during training, or, for recalibrate, when there are no positive examples for a particular rule. The fact that you're running lots of classifiers on an imbalanced dataset is fantastic -- it's a great edge case that revealed the two problems.

I have some ideas for how we might want to handle these cases, but would love to hear what you think makes sense.

Training on a single class doesn't really make sense for this particular algorithm (training is entropy-based... and there's no entropy when there is only one class :)
However, we probably don't want to just throw an error and crash someone who is 10,000 models into their ensemble or some long process.
Here's what I'd propose:

1) For training sets with only pos examples: .fit(): representation of the ruleset: [True] .predict(): everything is pos .predict_proba(): 1.0 for everything And throw a warning for each of these functions

2) For training sets with only neg examples: .fit(): empty set of rules [] .predict(): everything is neg .predict_proba(): all 0.0 And throw a warning for each of these functions

3) For proba recalibration (and initial calibration)

The min_samples and require_min_samples are supposed to be guardrails against this type of thing, but we do need to decide what to do if people remove the guardrails and things break down and for initial calibration.
Initial calibration: if the trainset is all pos, calibrate to 1.0 if the trainset is all neg, calibrate to 0.0
Recalibration: even if someone wants to set minimum number of samples to 0, require at least 1 sample for updating the rule in question

Would love to hear what you think.

Thanks, Ilan

flamby commented 5 years ago

Hi @imoscovitz

The training sets I use have always the two classes as I check that prior to fitting. I was referring to a kind of bug like this one in sklearn or this one, where during a split, only samples with one class are used (for tree-based algo).

BTW I would not have only a few NaN in the predict_proba result array like I do if it were due to a training set with only one class, don't you think?

I was able to reproduce what you said. The nan's seem to be appearing when the training set has no positive examples during training, or, for recalibrate, when there are no positive examples for a particular rule. The fact that you're running lots of classifiers on an imbalanced dataset is fantastic -- it's a great edge case that revealed the two problems.

I have some ideas for how we might want to handle these cases, but would love to hear what you think makes sense.

Training on a single class doesn't really make sense for this particular algorithm (training is entropy-based... and there's no entropy when there is only one class :)

However, we probably don't want to just throw an error and crash someone who is 10,000 models into their ensemble or some long process.

Anyway, I agree, for people who don't check their training set classes, it's better to not generate NaNs.

Here's what I'd propose:

For training sets with only pos examples: .fit(): representation of the ruleset: [True] .predict(): everything is pos .predict_proba(): 1.0 for everything And throw a warning for each of these functions

That makes sense.

For training sets with only neg examples: .fit(): empty set of rules [] .predict(): everything is neg .predict_proba(): all 0.0 And throw a warning for each of these functions

Seems good to me too.

For proba recalibration (and initial calibration)

The min_samples and require_min_samples are supposed to be guardrails against this type of thing, but we do need to decide what to do if people remove the guardrails and things break down and for initial calibration.

Initial calibration: if the trainset is all pos, calibrate to 1.0 if the trainset is all neg, calibrate to 0.0

Recalibration: even if someone wants to set minimum number of samples to 0, require at least 1 sample for updating the rule in question

Seems good too.

Another topic, but quite related to the way IREP or RIPPER are working : sklearn just implemented a new boosting algorithm that is worth looking : HistGradientBoostingClassifier. It does binning automatically too. I tested it and it generates better precision (which is quite logical for a boosting algo) than IREP or RIPPER but a much much lower recall. So I tried ensembling it w/ IREP but faced the error I raised in this issue due to the added column after fit is invoked.

Thanks!

imoscovitz commented 5 years ago

The training sets I use have always the two classes as I check that prior to fitting. I was referring to a kind of bug like this one in sklearn or this one, where during a split, only samples with one class are used (for tree-based algo).

Ah, yes, that's what I meant -- when a split (in our case the original train_test_split) results in a training set that only has one class.

BTW I would not have only a few NaN in the predict_proba result array like I do if it were due to a training set with only one class, don't you think?

I think if the entire training set only has negatives, it's giving NaNs for everything. I think it's when the recalibrationset has no examples for a particular rule (and the guardrails are off) that only some NaNs get generated. Is that what you're finding too?

sklearn just implemented a new boosting algorithm that is worth looking : HistGradientBoostingClassifier. It does binning automatically too

That's interesting -- I'll take a closer look. The binning part that I wrote is rudimentary and greedy, though on a few cases I looked at, it seems to do a better job of generating even bins than pandas does. It calculates the average size, begin creating a bin with that size, then adds members until you hit a new unique value and begin the next bin. (Also, sometimes this results in a smaller number of bins than the parameter specifies). A search or divide-and-conquer search might yield more even bins. (There are also entropy-based bin algos, but I think we'll leave those off for now.) Reworking the binner is something I keep thinking about doing, but it feels like much less of a priority than other things. Do you have any feelings about the binning stage?

Thanks!

flamby commented 5 years ago

Hi @imoscovitz

The training sets I use have always the two classes as I check that prior to fitting. I was referring to a kind of bug like this one in sklearn or this one, where during a split, only samples with one class are used (for tree-based algo).

Ah, yes, that's what I meant -- when a split (in our case the original train_test_split) results in a training set that only has one class.

Sorry. I was not crystal clear. both X_train and X_test contain the two classes. I was refering to a fact that the model's rules could not apply to data. Is that possible? or if no rules match the data, then it's end up with only zeros probabilities? And this rules that don't match reminded me some cases with tree-based algorithms for which during a split by one tree, that tree inherit only of a subset that contains only one class. Does it make sense now?

BTW I would not have only a few NaN in the predict_proba result array like I do if it were due to a training set with only one class, don't you think?

That's interesting -- I'll take a closer look. The binning part that I wrote is rudimentary and greedy, though on a few cases I looked at, it seems to do a better job of generating even bins than pandas does. It calculates the average size, begin creating a bin with that size, then adds members until you hit a new unique value and begin the next bin. (Also, sometimes this results in a smaller number of bins than the parameter specifies). A search or divide-and-conquer search might yield more even bins. (There are also entropy-based bin algos, but I think we'll leave those off for now.) Reworking the binner is something I keep thinking about doing, but it feels like much less of a priority than other things. Do you have any feelings about the binning stage?

I do think that part of the good generalization is in fact due to the fact that your binning mechanism is somehow elastic (aka up to max_bins) and not a fixed one. I've always seen this kind of binning w/ other algos. That must mean something ;-)

imoscovitz commented 5 years ago

Sorry. I was not crystal clear. both X_train and X_test contain the two classes. I was refering to a fact that the model's rules could not apply to data. Is that possible? or if no rules match the data, then it's end up with only zeros probabilities? And this rules that don't match reminded me some cases with tree-based algorithms for which during a split by one tree, that tree inherit only of a subset that contains only one class. Does it make sense now?

I think so. Are you asking if a model won't be able to learn values in X_test that aren't present in X_train? And if so, and there are no X_test values that are included in the model, it will always predict negative?

I do think that part of the good generalization is in fact due to the fact that your binning mechanism is somehow elastic (aka up to max_bins) and not a fixed one. I've always seen this kind of binning w/ other algos. That must mean something ;-)

Interesting. That's worth thinking about and exploring. I think it's worth looking at the binning more closely since 1) it's kindof strange -- but hopefully like you said good! since fewer unique values should theoretically lead to fewer bins --, and 2) more-importantly slow. For large numeric datasets binning actually can take most of the training time, which (probably for a future update) we could profile and figure out where to speed up the implementation.

flamby commented 5 years ago

I think so. Are you asking if a model won't be able to learn values in X_test that aren't present in X_train? And if so, and there are no X_test values that are included in the model, it will always predict negative?

No. I mean that somehow, the generated rules do not apply to the test set, perhaps out-of-scope bins.

I do think that part of the good generalization is in fact due to the fact that your binning mechanism is somehow elastic (aka up to max_bins) and not a fixed one. I've always seen this kind of binning w/ other algos. That must mean something ;-)

Interesting. That's worth thinking about and exploring. I think it's worth looking at the binning more closely since 1) it's kindof strange -- but hopefully like you said good! since fewer unique values should theoretically lead to fewer bins --, and 2) more-importantly slow. For large numeric datasets binning actually can take most of the training time, which (probably for a future update) we could profile and figure out where to speed up the implementation.

As for me, performance for now is good enough. If someday you can achieve multi-threading with complementary libs to pandas like Ray or Modin, then it will be good enough for huge datasets as well.

flamby commented 5 years ago

Hi @imoscovitz

I've seen some changes in your last commit that could fix the NaN issue I get sometimes. I'll give it a try in the coming days.

Thanks

imoscovitz commented 5 years ago

I've seen some changes in your last commit that could fix the NaN issue I get sometimes. I'll give it a try in the coming days.

Yup! NaNs should be taken care of.
Also took care of that related bug having to do with out-of-bin-range.
I'll reverse proba class order like you suggested.

Working on fixing up recalibrate so that it can take flexible and mixed input data formats, just like .fit now does.

flamby commented 5 years ago

I've seen some changes in your last commit that could fix the NaN issue I get sometimes. I'll give it a try in the coming days.

Yup! NaNs should be taken care of.

Also took care of that related bug having to do with out-of-bin-range.

I'll reverse proba class order like you suggested.

Working on fixing up recalibrate so that it can take flexible and mixed input data formats, just like .fit now does.

Great!

I'll look at the next commits carefuly, to rectify the proba class order in my code as well ;-)

imoscovitz commented 5 years ago

FYI, the latest commit reverses the proba order :)

flamby commented 5 years ago

@imoscovitz I think you can close this issue. I'll reopen it if I reproduce the NaN issue.