tensorflow / tcav

Code for the TCAV ML interpretability project
Apache License 2.0
631 stars 149 forks source link

Usage #35

Closed arita37 closed 5 years ago

arita37 commented 5 years ago

Hello,

Wondering if TCAV can be used with tabular sparse input data ? Or do we need to include embedding ?

NicoliAraujo commented 5 years ago

I'm also interested into applying TCAV to regular structured data. Any help with that?

jameswex commented 5 years ago

TCAV was developed to help understand the representation of data in ML models that encode information in high-dimensional space, so it does require that the data be represented by a high-dimension embedding/activation in the internals of a network.

What types of ideas were you thinking of exploring on structured data? Can you give examples?

NicoliAraujo commented 5 years ago

I was thinking about exploring relations between certain consumer profile group (young, unemployed, male/female, ...) and their credit scores (probability of paying a debt, provided by a dense neural network).

arita37 commented 5 years ago

Having thinking of user tabular data : age, gender,....

Do we need to create embedding for those ones ?

On Jun 17, 2019, at 23:40, James Wexler notifications@github.com wrote:

TCAV was developed to help understand the representation of data in ML models that encode information in high-dimensional space, so it does require that the data be represented by a high-dimension embedding/activation in the internals of a network.

What types of ideas were you thinking of exploring on structured data? Can you give examples?

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub, or mute the thread.

jameswex commented 5 years ago

@NicoliAraujo , are those consumer profile groups part of the input to the network that predicts the probability of paying debt? If they already are input features, then TCAV isn't necessary and instead you could use any input saliency measure (path integrated gradients, shapley values, ...)

arita37 commented 5 years ago

Question was : Does it make sense to use tabular data with TCAV or do we need asbolutely embedding ?

On Jun 17, 2019, at 23:48, Nicoli Araújo notifications@github.com wrote:

I was thinking about exploring relations between certain consumer profile group (young, unemployed, male/female, ...) and their credit scores (probability of paying a debt, provided by a dense neural network).

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub, or mute the thread.

jameswex commented 5 years ago

The inputs to your ML model can be tabular data to use TCAV. You define concepts to test for by creating a set of input examples that exemplify that concept (like a set of images that are all full of stripes to define the concept "stripey"). Then TCAV can use those examples (along with sets of random, non-stripey inputs) to determine if your model is sensitive to the concept you defined through your examples. To do so, TCAV requires looking at internal activations of your model (such as a logits layer) across these examples. So, embeddings aren't required.

NicoliAraujo commented 5 years ago

@NicoliAraujo , are those consumer profile groups part of the input to the network that predicts the probability of paying debt? If they already are input features, then TCAV isn't necessary and instead you could use any input saliency measure (path integrated gradients, shapley values, ...)

Well, these profiles are actually some examples of input data which are fed to the network. I'm already going to use clustering, but I'd like to test this technique (just for fun, curiosity) for comparison. Also, clustering doesn't show the connection between a net's output and a cluster, like tcav could possibly do.

arita37 commented 5 years ago

@James:

Thanks for explanation. Suppose the number of concept samples is dependant on the model capacity.

Did you try to run TCAV on tabular dataset (Ie Kaggle one) ?

On Jun 18, 2019, at 9:06, Nicoli Araújo notifications@github.com wrote:

@NicoliAraujo , are those consumer profile groups part of the input to the network that predicts the probability of paying debt? If they already are input features, then TCAV isn't necessary and instead you could use any input saliency measure (path integrated gradients, shapley values, ...)

Well, these profiles are actually some examples of input data which are fed to the network

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub, or mute the thread.

BeenKim commented 5 years ago

Re TCAV vs PIG and SHAP:

TCAV is global explanations (it explains entire class, rather than individual examples) while PIG and SHAP is local (it explains individual examples). Per @NicoliAraujo's use case, it sounds like you want global explanation - if that's the case TCAV (or any other global method) is what you need.

BeenKim commented 5 years ago

@arita37 TCAV was successfully used for multiple categorical data (e.g., NLP). They used this dataset https://github.com/conversationai/perspectiveapi/blob/master/api_reference.md#models to test fairness of some NLP models. We also tried EEG patient data as well (just different from of data).

arita37 commented 5 years ago

Hello,

Thanks. For tabular, does it require to be used by CNN Type network or MLP is fine ?

How to project back in the tabular space ?

On Jul 6, 2019, at 3:42, Been Kim notifications@github.com wrote:

@arita37 TCAV was successfully used for multiple categorical data (e.g., NLP). They used this dataset https://github.com/conversationai/perspectiveapi/blob/master/api_reference.md#models to test fairness of some NLP models. We also tried EEG patient data as well (just different from of data).

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.

BeenKim commented 5 years ago

Any network with some embeddings (some intermediate bottleneck) would work with TCAV. If you are using CNN ,we typically flatten 3d tensor to learn CAV, then reshape it to the right dimensions.

What do you mean by project back into tabular space? Do you mean tabular data as concept data or a network trained using tabular data? Either way, you can use the same simple reshaping. Unless I'm missing something?

arita37 commented 5 years ago

Let’s say we have tabular data (ie input of recommender). Suppose we have 1st layer is embedding, after... normal NN.

Question: is there a way to get back as concept data, tabular vector ?

1) My understanding is concept vector are numerical (embedded version) ones ?

2) In some way, “concept vector” can be some form of semi-supervised “clustering” ?

On Jul 9, 2019, at 1:57, Been Kim notifications@github.com wrote:

Any network with some embeddings (some intermediate bottleneck) would work with TCAV. If you are using CNN ,we typically flatten 3d tensor to learn CAV, then reshape it to the right dimensions.

What do you mean by project back into tabular space? Do you mean tabular data as concept data or a network trained using tabular data? Either way, you can use the same simple reshaping. Unless I'm missing something?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.

BeenKim commented 5 years ago

Let's be concrete for the sake of clarity. Say you have a movie recommender. Each data point is a (person id, movie) pair. What do you mean by "get back as concept data, tabular vector"? What do you mean by numerical? Do you mean that the feature has to take numerical values (e.g., age)? I think 2) is different set of question. I'm not sure what you mean. Can you give examples of what you are asking?

arita37 commented 5 years ago

Lets take this example.

MLP neural network. Input is : User_vector : (0,0,...1,0,0) Movie_vector: (0,0,...,0,1)

Output is binary softmax(y=1 if user saw movies).

In this case, how to interpret the concept vector ?

For example, can it “capture” implicitely “genre” of movies ?

By the way, TCAV looks very interesting concept. Thanks !

On Jul 9, 2019, at 8:04, Been Kim notifications@github.com wrote:

Let's be concrete for the sake of clarity. Say you have a movie recommender. Each data point is a (person id, movie) pair. What do you mean by "get back as concept data, tabular vector"? What do you mean by numerical? Do you mean that the feature has to take numerical values (e.g., age)? I think 2) is different set of question. I'm not sure what you mean. Can you give examples of what you are asking?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.

BeenKim commented 5 years ago

Got it! Input/output pair looks fine as long as it has some embedding layer that CAV can be learned. TCAV comes with statistical testing to help you decide whether it managed to capture the concept (e.g., genre) or not (output TCAV score will also print p-value for stat testing). If it passes stats, it means it probably learned the concept you want it to learn. If it didn't, then it's likely that network didn't care about the concept of genre.

BeenKim commented 5 years ago

Also here I'm adding NLP application http://arxiv.org/abs/1908.07336. I'll close this issue for now, feel free to reopen if needed!

arita37 commented 5 years ago

Thank you, this is useful.

TCAV is a kind of SVM applied on internal NN representations ?

On Aug 23, 2019, at 0:56, Been Kim notifications@github.com wrote:

Also here I'm adding NLP application http://arxiv.org/abs/1908.07336. I'll close this issue for now, feel free to reopen if needed!

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.