tensorflow / lucid

A collection of infrastructure and tools for research in neural network interpretability.
Apache License 2.0
4.67k stars 653 forks source link

Research: Caricatures #121

Open colah opened 5 years ago

colah commented 5 years ago

🔬 This is an experiment in doing radically open research. I plan to post all my work on this openly as I do it, tracking it in this issue. I'd love for people to comment, or better yet collaborate! See more.

Please be respectful of the fact that this is unpublished research and that people involved in this are putting themselves in an unusually vulnerable position. Please treat it as you would unpublished work described in a seminar or by a colleague.

Description

Caricatures are a powerful feature visualization technique that we haven't fully explored or published on yet. Roughly, they allow us to take an input image, feed it through to some layer of a network, and get a sense of how the network understood it.

image

Caircatures do this by creating a new image that has a similar but more extreme activation pattern to the original at a given layer.

There are two related properties that make caricatures really interesting as a visualization:

This makes caricatures a really important technique! This is because

  1. they are our first, simplest line of attack on model comparison

  2. they are a super useful tool for debugging feature visualization when it doesn’t work (because they remove neuron choice as a potential problem).

    image

Next Steps

  1. Caricatures are much more powerful when shown in context, as demonstrated at the top of this notebook. It would be great to scale this!

  2. It would be super excited to do more controlled experiments of changing network architectures and see how the caricatures respond. (The models would also be a useful resources to have for future model comparison work.) The one I'm most immediately excited about is exploring network branches, the effect of data sets, and preprocessing.

  3. We've recently had some early exciting results about "attributive caricatures" which might be interesting to explore:

    image

  4. It might be useful to show how they can be used for debugging feature vis.

colah commented 5 years ago

Some random interesting things:



image



colah commented 5 years ago

I've been thinking about "attribution caricatures" a lot more. See examples and notebook.

Attribution caricatures can be made to the output classes, as we saw earlier:

image

But they can also be done to a hidden layer, creating a caricature at one layer that emphasizes features that will be important at a later layer:

image

An idea related to this "iterated attribution" -- apply attribution iteratively to each layer between a start and end point. It's not clear this is principled, but the results seem interesting:

image

ncammarata commented 5 years ago

I found an interesting example of how attribution caricature perceive a bookshelf as different classes. View here.

image