Add feature importance notebook

yonromai commented 11 months ago

This PR adds a notebook to display various feature importance visualizations for any given experiment.

The new notebook is visible here.

Note: I picked the experiment to visualize (pca64_subsets_gpt4_mae) somewhat randomly, mostly because it has many features.

@dhimmel:

Is this close to what you had in mind? Will that work for the presentation?
Do you want me to visualize another experiment(s)?
I noted that IN_mondo_disease_grouping has an importance of 0 - seems likely to me that it's not present in any node (might be worth checking if there isn't a typo somewhere)

cc @eric-czech, in case you find this interesting / have suggestions

eric-czech commented 11 months ago

This PR adds a notebook to display various feature importance visualizations for any given experiment.

Quick question: Is this for an ordinal/continuous catboost model? I thought it was a multi-class model, so I am surprised to not see importances by label class (i.e. low, medium, high).

yonromai commented 11 months ago

(Rebase on main branch force push ^)

dhimmel commented 11 months ago

Here is the feature group viz, which I think is most important for the presentation:

Information content and depth features can be grouped into a category like "Topology" or "Structural". What is n_fg / topological counts? Those probably go in this category as well. For presentation we should give more readable names, like:

Description Embeddings
GPT Tags
Cross-references
Prefixes
Topology
Subsets
GWAS Traits

Then perhaps one slide for the actual individual feature importance. Machine names okay here since likely only a supplementary slide. Would be nice to see how each feature corresponds to each class, but might be too intense to make that now.

Merge at your ready

dhimmel commented 11 months ago

Note: I picked the experiment to visualize (pca64_subsets_gpt4_mae) somewhat randomly, mostly because it has many features.

That is fine. I think it is helpful to have a maximal model with the most features to produce the most comprehensive feature importance comparisons.

dhimmel commented 11 months ago

Is the conclusion that the PCA transformed description embeddings are way more important than any other feature group? That is quite incredible. I'd expect GPT prompts to be better given that they were tasked with a specific objective.

yonromai commented 11 months ago

Quick question: Is this for an ordinal/continuous catboost model? I thought it was a multi-class model, so I am surprised to not see importances by label class (i.e. low, medium, high).

@eric-czech: That's a good point (You're correct: it is still a mutli-class model and not a regression one, even though trying to exploit the ordinal property of the labels is definitely worth a quick future experiment IMO).

The feature importance scores in the notebook come from CatBoost's default get_feature_importance score calculation method (PredictionValuesChange) which, as far as I understand, calculates the impact of variations of each feature on the split of each node of each decision tree. The doc (linked previously) isn't very exhaustive, I'll need to spend more time to understand better how the feature importance scores are actually calculated.

@dhimmel :

Information content and depth features can be grouped into a category like "Topology" or "Structural".

Will do

What is n_fg / topological counts? Those probably go in this category as well.

Yes, they are the topological n_* count features.

For presentation we should give more readable names, like:

Okay

Then perhaps one slide for the actual individual feature importance. Machine names okay here since likely only a supplementary slide.

👍

That is fine. I think it is helpful to have a maximal model with the most features to produce the most comprehensive feature importance comparisons.

Okay but then we'll use a different model to show the feature importance (pca64_subsets_gpt4_mae) vs. the model that's actually used to output the labels and features in the repo (pca64_mae).

Is the conclusion that the PCA transformed description embeddings are way more important than any other feature group?

Yes, you can see in the table below (from the notebook) that the median importance of the PCA feature group is ~80% vs. the next most important feature (n_* topological feature group) accounts for ~5% of the feature importance.

That is quite incredible. I'd expect GPT prompts to be better given that they were tasked with a specific objective.

I agree: to me, the part that was the most surprising was the lack of efficiency of the GPT-4 as an individual classifier.

If you look at the how much additional information is added by "standalone" PCA64 (experiment: pca64_mae) vs. "standalone" GPT-4 (experiment: gpt4_mae):

It makes sense that PCA64 features takes the vast majority of the mass.

yonromai commented 11 months ago

@dhimmel I added the future grouping that you asked for. Gonna merge when the build is green.

BTW I'm not a huge fan of the log scale of this plot:

And the not log version is even worse (which is why I originally only included the log version):

It might be a time where a good old spreadsheet beats the fancy boxplot:

WDYT?

ravwojdyla commented 11 months ago

@yonromai One thing to consider/improve there, you can use the background_gradient pandas styling:

Code

```py ( pd.read_csv(StringIO(ddd), sep="\t") .set_index("Features") .style.background_gradient(cmap="Blues", axis=None, vmin=0, vmax=1) .format("{:.4f}") .set_properties(**{"font-weight": "bold"}, subset=["Median"]) .set_caption("Statistics") ) ```

dhimmel commented 11 months ago

Nice. I actually like the non log version, as its an accurate representation of the feature importance. I am guessing you don't like have the description embeddings have most of the effect and the variance is low so the boxes are not that helpful?

If have time, it would be great to get the non-log importance figure using the human feature group names. Can remove the axis label "feature" or we can crop that in the slides. Can consider a swarmplot or violin instead of a boxplot. There aren't that many points right? The different plot types only if you think that makes sense.

The log plot or spreadsheet is nice to show ordering / relative importance of the irrelevant feature groups, which is a secondary point. The main one is that Description Embeddings, Topology, and Cross-References are doing almost all the work.

yonromai commented 11 months ago

@yonromai One thing to consider/improve there, you can use the background_gradient pandas styling:

Neat, I hadn't done this before - I'll update! I guess the spreadsheets aren't really cool anymore (at the end of the day it's the same data right?)

Nice. I actually like the non log version, as its an accurate representation of the feature importance.

The log plot or spreadsheet is nice to show ordering / relative importance of the irrelevant feature groups

I assumed with the spreadsheet that people would figure that median(Description Embeddings Importance) = 80% clearly indicates that it is by far the most important feature - but I'll use a plot if you prefer that.

I am guessing you don't like have the description embeddings have most of the effect and the variance is low so the boxes are not that helpful? Can consider a swarmplot or violin instead of a boxplot. There aren't that many points right? The different plot types only if you think that makes sense.

the variance is low so the boxes are not that helpful?: Exactly. I feel like boxplots aren't a great fit here since they're so squeezed that it's hard to interpret info accurately. I agree that a violin plot can work better but I wonder if a simpler graph graph a barplot (or a spreasheet!) would do a better job in this case.

Can consider a swarmplot or violin instead of a boxplot. There aren't that many points right? The different plot types only if you think that makes sense.

Okay I'll make the change

dhimmel commented 11 months ago

I agree that a violin plot can work better but I wonder if a simpler graph a barplot (or a spreasheet!) would do a better job in this case.

We're on the same page. Bar plot would probably be the best in terms of simplest plot that shows everything the viewer should focus on. I think between the plots and styled spreadsheet, we've got all we need.

related-sciences / nxontology-ml

Add feature importance notebook #29