iterative / example-repos-dev

Source code and generator scripts for example DVC projects
https://dvc.org/doc
21 stars 13 forks source link

get-started: bar plot for feature importance #135

Open dberenbaum opened 1 year ago

dberenbaum commented 1 year ago

Bar charts were added in https://github.com/iterative/dvc-render/issues/8. Should we switch the feature importance plot from image to bar plot? I'm not sure it's worth it since then we will have no static image plots.

shcheklein commented 1 year ago

Yes, may be can come with some other plots that can be reasonable for this workflow instead of removing the feature importance. Anything that comes to your mind @dberenbaum @daavoo ?

At the end it would be nice to have more plots I think.

daavoo commented 1 year ago

We could plot the distribution of samples across target labels (0, 1) and/or splits (train, test). Those are usually represented as bar plots and would be associated with a different stage (prepare?)

dberenbaum commented 1 year ago

I would prefer to convert feature importance to a bar plot since we have support for it, and then add another image plot.

One idea is a SHAP summary plot, which is a more robust feature importance method:

image

It doesn't hurt to also keep the traditional feature importance as a bar plot since all of these methods have pros and cons, and it's can help to look at more than one method.

shcheklein commented 1 year ago

Okay, sounds good, we can try both. I like @daavoo 's suggestion since it's way simpler. I would add another image too though, I think it's good to have more images.

Let's take this when we are done with the global/flexible plots iteration and https://github.com/iterative/example-repos-dev/pull/117 is merged?

shcheklein commented 1 year ago

We can prioritize this in docs planning that @jorgeorpinel is preparing as a task that one of the people from the bigger "docs" group can take (including me, I would be happy to do this).

dberenbaum commented 1 year ago

I started on the SHAP one in https://github.com/iterative/example-repos-dev/pull/136, so anyone can feel free to pick up from there. There's a SHAP package, so it's not difficult to add.

Having some sample distribution plot is a good idea, although I have a couple concerns about the suggested bar plot:

  1. It won't change between experiments.
  2. Since the data is binary, a bar plot is likely not as interesting or realistic as the others. I don't think these are blockers, but maybe if someone works on these, they can try to find ways to make it more interesting.

A histogram of predictions from training and test data might be another good bar plot.