pytorch / captum

Model interpretability and understanding for PyTorch
https://captum.ai
BSD 3-Clause "New" or "Revised" License
4.92k stars 497 forks source link

'pad' word/token used instead of special token '<pad>' (for padding) in tutorial #653

Open elixir-code opened 3 years ago

elixir-code commented 3 years ago

📚 Documentation

Tutorial

https://captum.ai/tutorials/IMDB_TorchText_Interpret

Libraries used

Issue

The token for word 'pad' is used instead of the special token '<pad>' for padding sequences which have length less than minimum length and also as reference token in the TokenReferenceBase object.

Lines of the code with the issue

  1. In the cell number 11 (In [11]):

    PAD_IND = TEXT.vocab.stoi['pad']
  2. In cell number 14 (In [14]):

    text += ['pad'] * (min_len - len(text))

Evidence to suggest that special token '<pad>' must be used instead of token 'pad'

In cell 11 (In [11]), we want to find the index of the token used for padding. Currently, the index computed as PAD_IND is 6978:

>>> PAD_IND = TEXT.vocab.stoi['pad']
>>> PAD_IND
6978

However, the index of token to be used padding is actually the index of token '<pad>' which is 1 as can be inferred by running the following snippets of codes:

In code snippet 11 (In [11]) from tutorial used for training CNN model:

>>> PAD_IND = TEXT.vocab.stoi[TEXT.pad_token]
>>> PAD_IND
1

In code snippet 5 (In [5]) from tutorial used for training CNN model:

...
>>> model.embedding.padding_idx
1

Also, from following code snippets from the tutorial https://github.com/bentrevett/pytorch-sentiment-analysis/blob/master/4%20-%20Convolutional%20Sentiment%20Analysis.ipynb used for training the CNN models used in the tutorial with the issue, we can infer than the '<pad>' token instead of 'pad' token must be used.

In tutorial used for training CNN model, cell 7 (In [7]):

PAD_IDX = TEXT.vocab.stoi[TEXT.pad_token]

Also, in the tutorial used for training CNN model, in cell 18 (In [18]), the '<pad>' token instead of 'pad' token is used for padding small sentences:

...
tokenized += ['<pad>'] * (min_len - len(tokenized))
...

Suggested changes in the tutorial:

In cell number 11 (In [11]), the changes to be made are:

- PAD_IND = TEXT.vocab.stoi['pad']
+ PAD_IND = TEXT.vocab.stoi[TEXT.pad_token]

In cell number 14 (In [14]), the changes to be made are:

    text = [tok.text for tok in nlp.tokenizer(sentence.lower())]
    if len(text) < min_len:
-        text += ['pad'] * (min_len - len(text))
+        text += ['<pad>'] * (min_len - len(text))
elixir-code commented 3 years ago

However, if integrated gradients does not mandate that zero vector (or embedding of padding token) be used as reference token embedding, and allows the embedding of any random token can be used as reference, the above issue can be ignored.

bilalsal commented 3 years ago

Hi @elixir-code ,

Captum's LayerIntegratedGradients implementation allows you to define a custom baseline if a zero vector does not fit your problem (see baselines in the arguments to the .attribute() method ).

Check out this tutorial for an example of how baselines can be specified.

Hope this helps

NarineK commented 3 years ago

Hi @elixir-code, Integrated gradients does not mandate zero vector as a reference / baseline. It can be anything of your choice. Good point regarding pad. To be consistent, I'll make updates based on your suggestions.