Add support for enc-dec LMs

JoaoLages commented 3 years ago

First of all, great package and what a clever way to get gradient explainability for transformer LMs! We (me and @SSamDav) would really like to contribute :)

The goal of this PR is for the LM class to be able to handle enc-dec LMs, Masked LMs and Causal LMs. In order to do that, we propose the main changes to the code:

Always use a batch dimension, since enc-dec LMs need it for the generation step.
Discriminate between encoder and decoder hidden states. For causal LMs / Masked LM we consider them decoder only.
Add a 'model_type' argument in the model config so that we are able to know if the first hidden state is from the embedding layer or not (seems like that this only happens in causal models). We can remove this flag later if we check the type(lm.model) for example

We made sure to adapt the tests in order for them to pass. Also, all the notebooks are running as intended, but now they work for enc-dec models as T5!! 🎉 Note that the code that isn't used anywhere was not adapted and is probably buggy now with the batch size changes at least.

jalammar commented 3 years ago

Super excited for this!! Will start reviewing today!

JoaoLages commented 3 years ago

Thank you for your review @jalammar. Let's make this merge happen 💪 I even think Hugging face would want all of this that you started! Me and @SSamDav we'll try to review the rest of the comments until the end of the week.

The one additional question I have in addition to review comments is about using t5-small in Ecco_Output_Token_Scores.ipynb. The decoder has 6 layers, but to see the actual output probabilities, we have to set layer=6. Curious where the 7th layer comes from if we've not included the embedding layer.

Will investigate this 👍

JoaoLages commented 3 years ago

Okay I went over the PR and left some comments. This is awesome work!

The one additional question I have in addition to review comments is about using t5-small in Ecco_Output_Token_Scores.ipynb. The decoder has 6 layers, but to see the actual output probabilities, we have to set layer=6. Curious where the 7th layer comes from if we've not included the embedding layer.

Seems like I was using a previous version of transformers that had a different API (now I made sure to rerun everything with the same version in requirements.txt).

Looking at the Ecco_Evolution_of_Selected_Token notebook we can see that the first layer is the embedding layer, so HuggingFace changd this behaviour to match with causal/mlm models, which makes sense. I'll add the necessary changes.

jalammar / ecco

Add support for enc-dec LMs #46