Open kr-ramesh opened 2 weeks ago
Good catch. I believe the root cause is as what you describe: the same name but different shapes of values
fools the hooks, causing this gradient mismatch.
Need to think more on the general solution to avoid this.
🐛 Bug
Please reproduce using our template Colab and post here the link
https://colab.research.google.com/drive/1Eu0rxSdbdJbZUBlgJ4wxR5QXJ732bc90?usp=sharing (Some parts of the code are redundant, apologies for that)
To Reproduce
Expected behavior
No gradient mismatch should be there. A deeper dive into this seems to indicate a gradient mismatch, and it appears that during the computation of the activations, the mismatch appears (this might have something to do with the forward hooks). The batch size is in the first dimension for all data points. Error seems to be appearing because of relative_attention_bias related weights in the model - on freezing it, training proceeds as it should (there is a permute function transformation applied to the relative_attention_bias layer that is then stored in another variable, which might be leading to this - refer to modeling_t5.py in the huggingface transformers library).
Environment
Please copy and paste the output from our environment collection script (or fill out the checklist below manually).
You can get the script and run it with:
conda
,pip
, source): N/AAdditional context