Open RicRicci22 opened 2 months ago
hey! sorry not sure to understand, the dropout should not work with model.eval()
no?
Hi @ArthurZucker! Yes, this is correct, the thing that I was pointing out is that during training the dropout acting on the attention weights is also modifying them using the inverse of the dropout probability. On a standard layer, I understand that this makes it so that the output is similar in magnitude during training and evaluation.
However, in an attention layer, this causes the attention weights to not sum to 1. In turn, during inference, since the dropout is ineffective, the attention weights do sum to 1, and thus there is this discrepancy between train and test that I think can cause some troubles.
It is like the network is always making inferences on slightly out-of-distribution samples.
Not sure if I explained it better now!
That's for sure, but all models are trained that way 😄 I never thought about this, but dropout in general would be bad for inference, feel free to do some benchmarks I am curious!
Yes, I have also noticed this problem and put together a notebook to demonstrate what is happening. https://colab.research.google.com/drive/10f5pqC4XO5grmP1soT-Yh12-JOFg_i3w?usp=sharing
Due to dropout's behaviour in training, it will scale up the softmax outputs. This causes probabilities to be less than or greater than 1.0 and not exactly 1.0 during training, whereas at test/inference time, this behaviour of dropout is not seen because dropout becomes a no-op during inference, and all probabilities add up to 1.0. I think this might be a problem, but I haven't seen it addressed systematically anywhere. This is not new and has been discussed before on the PyTorch forums: https://github.com/pytorch/pytorch/issues/42929
The way I'd solve this is apply dropout before running softmax so that after softmax, the probabilities add up to 1.0.
TBH if you want you can open a PR to see if this improves performances of let's say Llama3 on MMLU for example! That would be relevant to say wether or not this has potential impact!
TBH if you want you can open a PR to see if this improves performances of let's say Llama3 on MMLU for example! That would be relevant to say wether or not this has potential impact!
Just changing the code and running inference probably won't help and will most likely make things worse since the model was trained in a specific way and inference should try to keep that the same. In my mind the only? way to actually test this theory is to train 2 models and compare them on specific benchmarks. I lack the GPU resources to do so though. @ArthurZucker if you have some resources I'm happy to send a PR and you could help me validate this?
System Info
Transformers version 4.41.2 Platform: Ubuntu 22.04.4 LTS Python: 3.10.14
Who can help?
@younesbelkada @ArthurZucker
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
When using GPT-2, during training there is a probability of dropout over the attention scores calculated in each transformer's layer. The dropout acts on the attention scores calculated using K@Q. The problem is that the dropout also normalizes the scores using the inverse of the probability as from the docs
This will make it so that the sum of the elements on each row does not necessarily sum to 1 (something that is true before because of the softmax operation). This per se I think is not a major problem, but the fact is that during inference the dropout is inhibited, and thus each line in the attention matrix sums to one, making it so that during inference the model is always slightly out-of-distribution.
Do you think we need another normalization after the dropout? I will put an example script here, to show the behavior when the module is in training or in evaluation
Here you can change the argument "train" to True or False to simulate training or evaluation modes. I also put a correction that you can switch on and off with the argument "correct_normalization".
You can see that when in training, without the correction, the sum is never 1, while if not in training, the sum is 1. This is what I was referring to when I was stating that in inference the model is always slightly out of distribution.
In that case, I'm using the eager implementation of attention, I don't know if this behavior is the same when using flashattention.
Looking forward to hearing from you.
Expected behavior
The expected behavior is that during both train and evaluation, each line in the attention matrix sum to 1 (even when using attention dropout).