Visualization of BertForSequenceClassification is hard to understand

davidefiocco commented 4 years ago

I am training a BERT trained on the CoLA task (see also https://github.com/pytorch/captum/issues/303), so that I can classify sentences as grammatically acceptable or not. Take this example:

"These tests don't work as expected" (grammatically acceptable) "These tests doesn't work as expected" (unacceptable)

If I run the two examples through the classifier I get as score through softmax in the binary classifier 0.99395967 and 0.00011.

In the two cases, I get interpretations via the notebook in the gist https://gist.github.com/davidefiocco/3e1a0ed030792230a33c726c61f6b3a5 (adapted from the SQuAD example in this repo). And in particular creating the visualization with

score_vis = viz.VisualizationDataRecord(
                        attributions_sum,
                        torch.max(torch.softmax(score[0][0], dim=0)),
                        torch.argmax(score[0][0]),
                        torch.argmax(score[0][0]),
                        text,
                        attributions_sum.sum(),       
                        all_tokens,
                        delta)

print('\033[1m', 'Visualization For Score', '\033[0m')
viz.visualize_text([score_vis])

In the two cases I get

The interpretations look extremely similar in spite of the dramatic change in score, while I would expect the interpretation to change and the model to focus on the grammar mistake (e.g. by focusing on the verb, or the noun)

Am I doing something wrong in my implementation (likely!), or is LayerIntegratedGradients not performing great in this example? Can someone suggest viable alternatives to try out?

NarineK commented 4 years ago

@davidefiocco , from the visualization I can see that attribution label isn't visualized correctly. I think maybe you are not attributing to the correct output target index. What is the target index that you use ? Is there any ?

score_vis = viz.VisualizationDataRecord(
                        attributions_sum,
                        torch.max(torch.softmax(score[0][0], dim=0)),
                        torch.argmax(score[0][0]), # why is this identical with the next class 
                        torch.argmax(score[0][0]),
                        <THIS SHOULD BE THE TARGET INDEX IF THERE IS NO THEN IT SHOULD BE NONE >,
                        attributions_sum.sum(),       
                        all_tokens,
                        delta)

NarineK commented 4 years ago

Oh, just saw attached notebook, I think that the problem is in the output. I think that you might need to provide a target. You might not have to do:

def custom_forward(inputs):
    out = model(inputs)[0][0]
    return out

but perhaps access the right target from there. My example that I gave you was for my model which is probably very different from yours

NarineK commented 4 years ago

In my case it is this:

def custom_forward(inputs):
    out = model(*inputs)[0]
    return out

You need to know what do you attribute to what. Do you attribute the cat or the dog to the inputs? target is the index of the class - cat index or dog index. In your case you can adjust the classes. See how we use target: https://captum.ai/tutorials/Multimodal_VQA_Interpret

davidefiocco commented 4 years ago

Thanks @NarineK for your reply! tl;dr I tried to follow your suggestion and looked also into https://captum.ai/tutorials/IMDB_TorchText_Interpret but I am struggling a bit with the results and I'll leave here some notes to clarify the problem.

I have a binary classification problem (CoLA task), and I would like to see in captum which features bring the (softmax) score of class 0 (i.e. ungrammatical sentence) up. Ideally, the interpretation should highlight features that make the sentence not grammatical.

I revised the code and it's visible at https://gist.github.com/davidefiocco/3e1a0ed030792230a33c726c61f6b3a5

The behavior of my model is such that for for some input_ids created from the (ungrammatical) sentence
"These tests does not work as expected." I get a tuple

> model(input_ids)
(tensor([[ 4.5192, -4.2522]], grad_fn=<AddmmBackward>),)

which contains the logits ("ungrammatical" and "grammatical" respectively) of my binary classifier.

To get the logits only, I created also have a predict method as follows:

def predict(inputs):
    return model(inputs)[0]

so that

> predict(input_ids)
tensor([[ 4.5192, -4.2522]], grad_fn=<AddmmBackward>)

To use the output of the model in captum, I thus have a (modified) custom forward that computes the softmax value for the "ungrammatical" class and now reads:

def custom_forward(inputs):
    preds = predict(inputs)
    return torch.softmax(preds, dim = 1)[0][0].unsqueeze(-1) # second [0] because I care about score of 0th class

so that

> custom_forward(input_ids)
tensor([0.9998], grad_fn=<UnsqueezeBackward0>)

I would like to see which features of my input drive made that score (aka the score of my 0th class) close to 1.

After computing attributions and summarizing them, I invoke viz.VisualizationDataRecord, but I am not sure I am doing this all right:

score = predict(input_ids)  # thus equal to tensor([[ 4.5192, -4.2522]], grad_fn=<AddmmBackward>)

score_vis = viz.VisualizationDataRecord(
                        attributions_sum,
                        torch.softmax(score, dim = 1)[0][0], # probability of my target class
                        torch.argmax(torch.softmax(score, dim = 1)[0]), # the predicted class
                        0, # not sure what I should put here
                        text,
                        attributions_sum.sum(),       
                        all_tokens,
                        delta)

Not sure if someone now sees something which is obviously not OK with all of the above. Full version of the code is at https://gist.github.com/davidefiocco/3e1a0ed030792230a33c726c61f6b3a5

NarineK commented 4 years ago

Hi @davidefiocco , yes, you can do it that way as well. If you only care about the 0(ungrammatical) class. Our visualization functions are just examples. You can also printout the scores and see if the visualization matches the score. The higher the score the more important is the feature for your output. Also, perhaps you want to compute attribution without applying the softmax. Because of softmax you might get a relatively small range of attribution scores.

davidefiocco commented 4 years ago

I removed the softmax from the custom_forward to make it read

def custom_forward(inputs):
    preds = predict(inputs)
    return preds[0][0].unsqueeze(-1)

The problem still persists though, as my interpretations change little with my input, while the score wildly varies :(

"These test fails." custom_forward(input_ids) = tensor(-3.5023], grad_fn=<UnsqueezeBackward0>)
"These tests fail" custom_forward(input_ids) = tensor([4.2576], grad_fn=<UnsqueezeBackward0>)

NarineK commented 4 years ago

@davidefiocco , the attribution could be correct but the visualization is tricky because I can see that the first example has much larger negative attribution. The word fails is positive but it is probably very small positive value. Attribution is not perfect and it looks like it is a bit confused with word fails When we normalize attributions then the color coding also gets adjusted. You can try to normalize the attribution across those 2 samples. Also looking into the distribution of attributions for each embedding vector can be helpful.

The normalization happens here: attributions = attributions / torch.norm(attributions) I'd try without normalization or try a different normalization technique.

NarineK commented 4 years ago

@davidefiocco, have you made any progress here ?

davidefiocco commented 4 years ago

Hi @NarineK thanks a lot for following up here! I had tried eliminating normalization, but that didn't solve it. I'll try to give it another shot and will update the issue here with my findings.

davidefiocco commented 4 years ago

Hi @NarineK , I hope you're well, apologies for the delay.

I could not solve the problem, but I am more convinced that there is some bug in my code now.

The good news is that now my code is perfectly reproducible at least, as for debugging I borrowed a (binary) sentiment classifier whose weights are publicly downloadable via the HuggingFace great collection at https://huggingface.co/lvwerra/bert-imdb (trained on IMDB)

Sentiment can be reversed simply by changing one word, and so it should be easier to debug.

The notebook for now is runnable without modifications by making a copy of what you find at https://colab.research.google.com/drive/1snFbxdVDtL3JEFW7GNfRs1PZKgNHfoNz If one doesn't have/want to use a Google account to run colab, the notebook should be executable locally on a linux machine having wget and pip installed.

The resulting visualization is

which seems buggy to me. I'll try to look at the code more closely, but if you have hints that would be totally welcome!

P.S.: I put the same code on https://gist.github.com/davidefiocco/40a1395e895174a4e4d3ed424a5d388a also, for reference.

NarineK commented 4 years ago

@davidefiocco , try to change the text a little bit and see how your attribution changes. This is actually a very interesting example and it shows that the network is prone to adversarial attacks and is possibly easy to fool. Try to remove the dot at the end of the sentence and see how the attribution changes. Now, try something like this: text = "The movie was one of those amazing movies you can not forget" or this: text = "The movie was one of those amazing movies" The attribution starts to make more sense ?

NarineK commented 4 years ago

Try also to increase n_steps and observe the delta:

attributions, delta = lig.attribute(inputs=input_ids,
                                    baselines=ref_input_ids,
                                    n_steps=700,
                                    internal_batch_size=3,
                                    return_convergence_delta=True)
delta

NarineK commented 4 years ago

from what I see you are attributing to the negative class. Positive (good recommendation) class would be ?

def custom_forward(inputs):
    preds = predict(inputs)
    return torch.softmax(preds, dim = 1)[0][1].unsqueeze(-1)

davidefiocco commented 4 years ago

Hi @NarineK, thanks!

I tried to change only the input (by shortening the sentence) and attribute the negative label (so no other change in the notebook, just the input). I get:

And this is what I get with a similar, but negative "micro-review":

I computed attributions and convergence deltas for various values of n_steps, and by eyeballing the results convergence seems reached at ~200: delta

Increasing the delta beyond the default value of 50 doesn't dramatically change the interpretation I visualize though.

Also as for your question, indeed the custom forward returning the softmax score for the positive class would be written as

def custom_forward(inputs):
    preds = predict(inputs)
    return torch.softmax(preds, dim = 1)[0][1].unsqueeze(-1)

But also playing with that definition doesn't yield results that make sense intuitively to me (I would like the interpretation to be focused on the adjective, which sets the sentiment of the review).

NarineK commented 4 years ago

Sorry, closed by accident! I attributed to the the good recommendation class that was predicted with 0.96 probability and this is what I get.

I'm seeing different results than what you see. Does this make sense ?

davidefiocco commented 4 years ago

Ha!

Mmm I am not convinced :) the correct spelling in English for the sentence is

"The movie was one of those amazing movies"

You used instead

"The movie was one of those amaizing movies"

If I modify the custom forward to give the softmax score for the positive class i.e.

def custom_forward(inputs):
    preds = predict(inputs)
    return torch.softmax(preds, dim = 1)[0][1].unsqueeze(-1)

I get for the first and second sentences respectively 0.9989 (so a very positive rating) and 0.0413 ("amaizing" is not an English word but the model goes for a negative sentiment).

I do visualize with

score_vis = viz.VisualizationDataRecord(attributions_sum,
                                        torch.softmax(score, dim = 1)[0][1],
                                        torch.argmax(torch.softmax(score, dim = 1)[0]),
                                        0,
                                        text,
                                        attributions_sum.sum(),       
                                        all_tokens,
                                        delta)

but the viz is still a bit different from yours, and looks like

Can you try rerunning with the "amazing" :D spelling?

From my attempts that Huggingface model is pretty good at getting correct the sentiment score. Looks to me that my attributions (or the viz) don't quite cut it.

NarineK commented 4 years ago

Oh, what are the classes. I think that I got confused on the task. 0 -> the index of misspelled 1 -> The index of correctly spelled.

Is this a spelling correctness model or a movie recommendation model ?

NarineK commented 4 years ago

This is what I get when I attribute to class 1

davidefiocco commented 4 years ago

Hi @NarineK thanks for looking into this!

I switched to another binary classifier, a sentiment one trained on IMDB, as weights are easy to find online. See above what I write my https://github.com/pytorch/captum/issues/311#issuecomment-606912060

Your interpretation looks totally legit to me, brilliant! Can you share how you changed from my notebook in order to get it?

I understand that the custom forward should read

def custom_forward(inputs):
    preds = predict(inputs)
    return torch.softmax(preds, dim = 1)[0][1].unsqueeze(-1)

But then how do you get the rest (e.g. attribution and visualization)? Can you post snippets of the diffs?

NarineK commented 4 years ago

Hi @davidefiocco, sorry I forgot to reply. Can you access this link ? https://colab.research.google.com/drive/1Lw3JTZio03VwPvSVFzLJmZ52oBRpo9ZM Let me know if you can't access it.

davidefiocco commented 4 years ago

Hi @NarineK, I could access and run your code and tried a couple of positive sentences as examples and it seems to work OK, thanks a million! I have saved a "frozen" copy of it in https://gist.github.com/davidefiocco/47137f6eb7e3351c9bac4580c2ccc9d4 as I understand you may work on that in the future.

Here's some follow-up:

From what I see, when attributing it's key to play with the parameter internal_batch_size=3, (together with a high number of n_steps). Can you explain why that may be the case?
Will your representation work to highlight negative sentiment as well? With your code, changing the sentence to negative by changing the adjective I get for which I would have expected a stronger negative (i.e. stronger red) attribution to the adjective.
How can I attribute the negative (0) label? I guess I need to change the custom forward into

def custom_forward(inputs):
    preds = predict(inputs)
    return torch.softmax(preds, dim = 1)[0][0].unsqueeze(-1) #  changed here

and the viz syntax

score_vis = viz.VisualizationDataRecord(attributions_sum,
                                        torch.softmax(score, dim = 1)[0][0], #  changed here
                                        torch.argmax(torch.softmax(score, dim = 0)[0]),
                                        0, #  changed here
                                        text,
                                        attributions_sum.sum(),       
                                        all_tokens,
                                        delta)

can you confirm this? I tried to attribute a negative example (simply by putting a negative adjective in the sentence to evaluate) but to no avail.

Thanks!

NarineK commented 4 years ago

Hi @davidefiocco ,

internal_batch_size is purely for memory management reasons. It will make sure that small chunks of inputs as processed at the time. n_steps is the number of integral approximation steps. The higher that number the better the approximation is but if you set it too high you need to adjust memory with internal_batch_size and it will take longer to execute.
Yes! That's right
custom_forward function looks right. torch.argmax(torch.softmax( is here tricky. You need to make sure that the negative score has the max value or not take the max. Other than that looks right to me.

davidefiocco commented 4 years ago

Hi @NarineK !

On point 1. though, I would like to flag that with constant n_steps, changing the internal_batch_size seem to affect the results. With

attributions, delta = lig.attribute(inputs=input_ids,
                                    baselines=ref_input_ids,
                                    n_steps=300,
                                    internal_batch_size=1,
                                    return_convergence_delta=True)

I get

But with the exact same code, setting internal_batch_size=3, the result of the viz is instead

On point 3. can you give more extra hints on what should be the third argument of viz.VisualizationDataRecord to attribute the negative (0) class?

NarineK commented 4 years ago

Hi @davidefiocco,

That's interesting ! Thank you for bringing up point 1. It might be a bug. We'll take a look into it. I think that instead of doing argmax you should use for negative

torch.softmax(score, dim = 1).squeeze(0)[0]

and for positive

torch.softmax(score, dim = 1).squeeze(0)[1]

What do you think ?

davidefiocco commented 4 years ago

Hi @NarineK, thanks again.

Dealing with the negative label and passing a sentiment-negative sentence with

def custom_forward(inputs):
    preds = predict(inputs)
    return torch.softmax(preds, dim = 1)[0][0].unsqueeze(-1) #  changed here

and

# storing couple samples in an array for visualization purposes
score_vis = viz.VisualizationDataRecord(attributions_sum,
                                        torch.softmax(score, dim = 1)[0][0], #  changed here
                                        torch.softmax(score, dim = 1).squeeze(0)[0],
                                        0, #  changed here
                                        text,
                                        attributions_sum.sum(),       
                                        all_tokens,
                                        delta)

Gives me this visualization in which I would have expected the word "terrible" show up with a greener-shade-of-green.

The word importance thus doesn't look very convincing to me :(

About the difference in results when varying the internal_batch_size parameter, let me know if I should file/refer to a separate issue!

NarineK commented 4 years ago

Hi @davidefiocco , the custom_func was unnecessarily complicated and it led to problems with different batch sizes. Selecting first element and un-squeezing was messing things up: Here is a simpler version:

def custom_forward(inputs):
    preds = predict(inputs)
    return torch.softmax(preds, dim = 1)[:, 0] # for negative attribution, 
    #return   torch.softmax(preds, dim = 1)[:, 1] <- for positive attribution

In terms of your example with terrible movies, you can get better attribution if you increase the number of steps as I show in the notebook below. https://colab.research.google.com/drive/1Lw3JTZio03VwPvSVFzLJmZ52oBRpo9ZM

Note that attribution is not a perfect tool. It can help you to better understand feature importance but it is not solving all problems of feature importance.

Hope this helps!

NarineK commented 4 years ago

@davidefiocco , can we close this issue if we are done with the questions ? Thank you!

davidefiocco commented 4 years ago

Sure, thanks!

It may be useful to consolidate all of the above in a tutorial example, as this could save to other users some time.

R-icntay commented 2 years ago

Hello @NarineK,

Would you mind re-sharing the notebook below - would really appreciate it. Seems that the link is broken.

Thank you.

In terms of your example with terrible movies, you can get better attribution if you increase the number of steps as I show in the notebook below. https://colab.research.google.com/drive/1Lw3JTZio03VwPvSVFzLJmZ52oBRpo9ZM

davidefiocco commented 2 years ago

Hello @NarineK,

Would you mind re-sharing the notebook below - would really appreciate it. Seems that the link is broken.

@R-icntay seems that good old me treasured this in a gist: https://gist.github.com/davidefiocco/47137f6eb7e3351c9bac4580c2ccc9d4 but it lacks additional improvements that were performed later in the discussion. It'd be great if @NarineK could have a look and/or reshare the notebook!

R-icntay commented 2 years ago

@davidefiocco, Thanks a lot - Honestly!! I was trying to follow through your discussion and could not find some of it that matches the code. I also found this: https://colab.research.google.com/drive/1pgAbzUF2SzF0BdFtGpJbZPWUOhFxT2NZ#scrollTo=X-nyyq_tbUDa

as mentioned here: https://github.com/pytorch/captum/issues/150#issuecomment-665662534

Could that help?

And honestly, @davidefiocco @NarineK , this was a great thread. Should have made it into a blog post or something 🙂. So thank you all!

pytorch / captum

Visualization of BertForSequenceClassification is hard to understand #311