Perform stability analysis for all classifiers

I've pushed the full analysis across all feature sets into full_stability.ipynb. Here's a quick tl;dr you can refer to (I tried writing some plain english in the notebooks as well).

We are doing k-fold cross validation repeated n separate times. This results in k*n total folds.

PR Curve There are n PR curves, which I have plotted on top of one another in a single plot as shown below. The function I wrote does color by classifier.

Screenshot from 2020-01-23 22-56-54

Tree Analysis Over each of the k*n folds, we get a tree with its own specific decision rules. I have displayed all rules for each level separately (left y-labels in the plot below) along with a count of how many folds that rule appeared in (x - axis). Knowing whether a rule "appeared" or not isn't useful without knowing if the rule actually helped. The right y-labels say how many examples were separated out into a pure node by this rule at this level. This number varies per fold, so I am displaying the mean only. NOTE: If there is more than one level 1 rule, this plot will not say anything about interactions! Luckily our top level rules tend to be stable.

Screenshot from 2020-01-23 23-01-28

LASSO Coefficients Over each of the k*n folds we also train a separate LASSO logistic regression. I made a violin plot of the coefficients (aka effect sizes) over all folds. NOTE: I am filtering out features that have basically all zero effect sizes, to keep the plot readable.

Screenshot from 2020-01-23 23-04-07

nickbhat / bgc_tran

Perform stability analysis for all classifiers #7