Closed geniki closed 1 year ago
Hi @geniki Thank you for reporting the issue.
but the problem should be easy to reproduce with any Longformer + FP16 example
It would be really nice if you can provide an example script that could reproduce the issue you reported, especially you mentioned should be easy to reproduce
🙏 Looking forward for it!
some of which have been fixed one by one
Could you remind me which PRs or commits fixed this issue 🙏 That will help a lot, thank you.
Thanks for your response @ydshieh. Here are some example where this issue has been addressed for other models: https://github.com/huggingface/transformers/pull/20605 https://github.com/huggingface/transformers/pull/18057 https://github.com/huggingface/transformers/pull/19229 https://github.com/huggingface/transformers/pull/17437
I'll try to make an online example with Longformer work somehow. Do you have any model training tests with small dummy data?
Hi @geniki You can take any dataset on HF Hub (that are for specific task you are working on), and select a subset of it (say the first 1024 examples).
However, as you already know some fixes (in you above comment), would you like to try to experiment a fix for this model (with your own dataset, potentially a subset) and open a PR ❤️ ? If not, no worry, but in this case, as I mentioned, a script that could reproduce would be really nice 👍
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.
Please note that issues that do not follow the contributing guidelines are likely to be ignored.
System Info
transformers 4.20 / transformers 4.21 Ubuntu 20, python 3.8
Who can help?
@ydshieh
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
Apologies, I'm using my own dataset but the problem should be easy to reproduce with any Longformer + FP16 example. Upgrading from transformers 4.20 to 4.21 causes Longformer training loss to stay stuck around its initial value. When using transformers 4.20 + FP16 and transformers >= 4.21 + FP32, training loss declines as expected.
https://github.com/huggingface/transformers/pull/17306 seems to be what caused this. You can see on that issue that it affected other models too, some of which have been fixed one by one. Longformer is still affected as of transformers 4.26.
Expected behavior
Be able to train Longformer using fp16 precision on recent version of transformers.