The original contribution has not been merged yet, but it shows lower memory usage and better performance on XLA. So I think it's worth adding it here.
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.
This is actually a ripoff of the work originally done as a contribution to transformers:
https://github.com/huggingface/transformers/pull/31129/
The original contribution has not been merged yet, but it shows lower memory usage and better performance on XLA. So I think it's worth adding it here.