nickbhat / bgc_tran

GNU General Public License v3.0
3 stars 0 forks source link

Perform stability analysis for all classifiers #7

Closed nickbhat closed 4 years ago

nickbhat commented 4 years ago

Currently writing a notebook to perform stability for siderophore prediction with only pfam, to keep things concrete. Some things I've noticed as I make it:

  1. For gram positives, only 15/193 BGCs are siderophore producing. For gram negatives it's far better at 24/64.
  2. A single rule is basically capturing all of the signal in G- siderophore.

The initial analysis is in pfam_stability.ipynb. I will put this all into some functions and run it for all sets of features tomorrow.

nickbhat commented 4 years ago

I've pushed the full analysis across all feature sets into full_stability.ipynb. Here's a quick tl;dr you can refer to (I tried writing some plain english in the notebooks as well).

We are doing k-fold cross validation repeated n separate times. This results in k*n total folds.

PR Curve There are n PR curves, which I have plotted on top of one another in a single plot as shown below. The function I wrote does color by classifier.

Screenshot from 2020-01-23 22-56-54

Tree Analysis Over each of the k*n folds, we get a tree with its own specific decision rules. I have displayed all rules for each level separately (left y-labels in the plot below) along with a count of how many folds that rule appeared in (x - axis). Knowing whether a rule "appeared" or not isn't useful without knowing if the rule actually helped. The right y-labels say how many examples were separated out into a pure node by this rule at this level. This number varies per fold, so I am displaying the mean only. NOTE: If there is more than one level 1 rule, this plot will not say anything about interactions! Luckily our top level rules tend to be stable.

Screenshot from 2020-01-23 23-01-28

LASSO Coefficients Over each of the k*n folds we also train a separate LASSO logistic regression. I made a violin plot of the coefficients (aka effect sizes) over all folds. NOTE: I am filtering out features that have basically all zero effect sizes, to keep the plot readable.

Screenshot from 2020-01-23 23-04-07