Closed nickbhat closed 4 years ago
I've pushed the full analysis across all feature sets into full_stability.ipynb
. Here's a quick tl;dr you can refer to (I tried writing some plain english in the notebooks as well).
We are doing k-fold cross validation repeated n separate times. This results in k*n total folds.
PR Curve There are n PR curves, which I have plotted on top of one another in a single plot as shown below. The function I wrote does color by classifier.
Tree Analysis Over each of the k*n folds, we get a tree with its own specific decision rules. I have displayed all rules for each level separately (left y-labels in the plot below) along with a count of how many folds that rule appeared in (x - axis). Knowing whether a rule "appeared" or not isn't useful without knowing if the rule actually helped. The right y-labels say how many examples were separated out into a pure node by this rule at this level. This number varies per fold, so I am displaying the mean only. NOTE: If there is more than one level 1 rule, this plot will not say anything about interactions! Luckily our top level rules tend to be stable.
LASSO Coefficients Over each of the k*n folds we also train a separate LASSO logistic regression. I made a violin plot of the coefficients (aka effect sizes) over all folds. NOTE: I am filtering out features that have basically all zero effect sizes, to keep the plot readable.
Currently writing a notebook to perform stability for siderophore prediction with only pfam, to keep things concrete. Some things I've noticed as I make it:
The initial analysis is in
pfam_stability.ipynb
. I will put this all into some functions and run it for all sets of features tomorrow.