inference methods needing only rules' features and not whole features inputed during training

flamby commented 5 years ago

Hi @imoscovitz

Right now, I'm training IREP or RIPPER on up to 2 or 3k features. In the end, the generated rules tend to use only 30 features max. In my case, doing a such amount of features engineering leads to - after dropna invoked - up to 10% of rows removed, sometimes more when features engineering very long moving averages.

This is perfectly fine for training - no other choice ;-) - but not very practical for inference : _If I'm generating a dataframe with only the rules' features, the predict and predict_proba methods aren't happy about it, since the model seems to have hard-coded the desired columns to be the ones from training._

Step	# of Inputs features needed	Comment
`fit`	1000	~30 features used in the rules
`predict`	1000	needs to generate a df with the 1000 features, whereas only ~30 features will be used by the inference method
`predict` expected behavior	~30	leading to more compact data for inference and less dropped rows in my case ;-)

Do you think it's achievable? That would mean that the columns attribute would keep listing the fitted features, but another column would indicates the rules' used columns, and the latter would be used by inference methods. Keeping the fitted features in the model is probably a good idea for reproducibility.

Also, being able to purge (so to speak) the non-used hard-coded features of a fitted classifier would be great as well.

In fact, being able to edit the rules afterward would be great too. In my case, I do some domain knowledge analysis of the generated rules and would like - sometimes - to remove one or two rules, but did not find how to do it. Silly me, it's as easy as classifier.ruleset_.rules.pop(<rule_index_to_delete>)

Does it make sense to you?

Thanks!

imoscovitz commented 5 years ago

Hi @flamby,

Hmm, I might not be understanding the issue.

1) By fitted features, you're referring the to what happens to features during the binning stage, to the group of features the model uses in the final ruleset, or the fact that you might have renamed some columns from your dataframe? 2)

Silly me, it's as easy as classifier.ruleset_.rules.pop()

I'm wondering if it would be better behavior to give users a remove-by-index function for the classifier object so that they don't need to understand the internal workings? Or do you think this is obvious enough that some docstring would be better than adding a simple function?

This also makes me wonder if there should be an ability to easily manually add a rule... 3) Are you also asking whether it would be possible to output a list of which features are used so that you can just drop the unneeded ones?

Thanks so much! Ilan

flamby commented 5 years ago

Hi @imoscovitz

By fitted features, you're referring the to what happens to features during the binning stage, to the group of features the model uses in the final ruleset, or the fact that you might have renamed some columns from your dataframe?

To the features the model uses in the final ruleset.

Silly me, it's as easy as classifier.ruleset_.rules.pop()

I'm wondering if it would be better behavior to give users a remove-by-index function for the classifier object so that they don't need to understand the internal workings? Or do you think this is obvious enough that some docstring would be better than adding a simple function?

Sure. A convenient method would be a good idea.

This also makes me wonder if there should be an ability to easily manually add a rule...

I was just trying to figure out how to do that ;-)

Are you also asking whether it would be possible to output a list of which features are used so that you can just drop the unneeded ones?

IIUC, I've already a snippet (see below) that does that, but including it or something better should help people using it for features selection.

features = [cond.feature for rule in classifier.ruleset_.rules for cond in rule.conds]
return list(set(features))

You could add features importance, like what we get when we do features selection with RandomForest, by implemeting feature_importances_ to be sklearn compatible.

BTW you should publicize this features selection capability as these two algorithms are such good methods, at least in my case for highly correlated features.

Thanks!

imoscovitz commented 5 years ago

To the features the model uses in the final ruleset.

Gotcha. Can you explain again the problem you're facing during .predict and what you would like the package to be able to do?

I was just trying to figure out how to do that ;-) (add a rule) Cool. In the meantime until we add this capability, try something like this:
from wittgenstein.base import Cond, Rule, Ruleset
...
new_cond1 = Cond('feature', 'value')
new_cond2 = Cond('feature', 'value')
...
new_rule = Rule([new_cond1, new_cond2...])
irep.ruleset_.rules.append(new_rule)
If it's a continuous feature though, you'll need to use binned values. The fitted binner would be in irep.bintransformer.

You could add features importance, like what we get when we do features selection with RandomForest, by implemeting featureimportances to be sklearn compatible. Interesting idea. So, because it's an if/and/or thing, rather than a weights-based model, none of the features used by the model are "more important" than the others. "Importance" could mean

Which features appear more often in the ruleset (you might get mostly 1 values)

Which features were the most helpful in improving performance on the training set. Fairly similar to what feature importance usually means, though typically feature importance means importance for predictsets, not just trainset. What do you think? BTW you should publicize this features selection capability as these two algorithms are such good methods, at least in my case for highly correlated features. This is a great idea. For the next update, it should be included in the readme. A method/attribute called something like irep.selected_features that makes an easy-to-read list of selected features similar to the code you wrote above would be useful.

Thanks!

flamby commented 5 years ago

Hi @imoscovitz

To the features the model uses in the final ruleset.

Gotcha. Can you explain again the problem you're facing during .predict and what you would like the package to be able to do?

Right now, if I train a model with 200 features, next, when I want to predict, I must provide a dataframe with those 200 features, despite the rules use only 10 of them. If I retrain a model with these 10 features (the ones chosen by the previous model), I don't get the same rules, and in fact, the performance of the new model is worst than the first one. Hence my recurring jingle : IREP is a very good features selection technic ; it somehow needs noise to generate good rules ;-)

So, I want to keep the first model, but I don't want that using it for inference needs, as requirement, to provide the dataframe with 200 features, but only one dataframe with the 10 features used in the rules.

So, as pseudo-code, the change could be like this :

# cols contains 200 features
X_train, X_test, y_train, y_test = train_test_split(df[cols], df["Target"],
                                                      train_size=.8, shuffle=False)
classifier = lw.IREP(n_discretize_bins=8, random_state=0)
classifier.fit(X_train, y_train)
# after running the below method, the model does not need anymore the 200 features hard-coded as pre-requisite
classifier.remove_unused_features()
rules_cols = classifier.rules_columns_  # <-- similar to my snippet retrieving features from rules
# rules_columns_ contains the 10 features 
classifier.predict(X_test[rules_cols])
# then later, when the model is validated, one can do inference from a simple dict with 10 features, instead of 200 features, for every predictions.
new_prediction_inputs = {"features1": ..., "features10": ...}
predict_df = pd.DataFrame([new_prediction_inputs])
probas = classifier.predict_proba(predict_df)

Hopes it helps getting the idea ;-)

I was just trying to figure out how to do that ;-) (add a rule) Cool. In the meantime until we add this capability, try something like this:
from wittgenstein.base import Cond, Rule, Ruleset
...
new_cond1 = Cond('feature', 'value')
new_cond2 = Cond('feature', 'value')
...
new_rule = Rule([new_cond1, new_cond2...])
irep.ruleset_.rules.append(new_rule)
If it's a continuous feature though, you'll need to use binned values. The fitted binner would be in irep.bintransformer.

Thanks. It's very simple in fact.

You could add features importance, like what we get when we do features selection with RandomForest, by implemeting featureimportances to be sklearn compatible. Interesting idea. So, because it's an if/and/or thing, rather than a weights-based model, none of the features used by the model are "more important" than the others. "Importance" could mean

Which features appear more often in the ruleset (you might get mostly 1 values)

Which features were the most helpful in improving performance on the training set. Fairly similar to what feature importance usually means, though typically feature importance means importance for predictsets, not just trainset. What do you think?

Thanks for the clarification. I guess we'll need to test both methods to see which one is the best predictor. Perhaps by comparing their result to LIME or shapley results.

BTW you should publicize this features selection capability as these two algorithms are such good methods, at least in my case for highly correlated features. This is a great idea. For the next update, it should be included in the readme. A method/attribute called something like irep.selected_features that makes an easy-to-read list of selected features similar to the code you wrote above would be useful.

It seems that sklearn API gives a feature_importances_ attribute, which returns an array of each feature’s importance in determining the splits (for RandomForest).

flamby commented 5 years ago

Hi @imoscovitz

I gave selected_features_ a try, and it works like a charm, even simpler than my pseudo-code. Congrats!

I think we can close this issue.

imoscovitz commented 5 years ago

Super!

Yes, I essentially began with your code -- only major difference was I did it the longwinded O(n^2) way instead of with sets in order to preserve their order.
FYI, because of your point that you might only include selected features in the predict/score set, I did some things that will hopefully make prediction/scoring more flexible. You should no longer need to include all the features from the training set -- only selected features. You can provide a param with correct names. Also, if the predict/score feature names are different (maybe it was np one time but df-column names the other) if you provide the same number of features as in the training set, it will assume the original names (with a warning).
For a future update we can create a featureimportance, in case some people want that too. But that should be a future thing.

Thanks!!

imoscovitz / wittgenstein

inference methods needing only rules' features and not whole features inputed during training #6