Open karan-uppal3 opened 6 days ago
Hey! where is the ideal
scenario coming from? 🤗 we tried to follow the original implementation on this!
Hello @ArthurZucker! According to the pseudo code given in Figure 15 and Figure 16 of the original paper, the input to the experts doesn't contain the additional jitter noise.
System Info
System Info
Who can help?
@ArthurZucker @younesbelkada
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
The output is
which ideally should give True and the mean difference should be zero.
This is because in
SwitchTransformersTop1Router
, thehidden_states
are multiplied with jitter noise which persists even when you pass it to the experts.https://github.com/huggingface/transformers/blob/e71a01a104dd663c730e494eb0b6467bb51df357/src/transformers/models/switch_transformers/modeling_switch_transformers.py#L159-L161
Expected behavior
Ideally, no jitter noise should be present when passing the input to the experts, returning True and the mean difference as 0.