Open johnml1135 opened 9 months ago
This seems like it would be pretty useful for our purposes since we're training large models. The pytorch blog says they're seeing GPU memory savings of 20%-110% during training, speedups of 10%-70% during training, and speedups of 5%-20% during inferencing. I'd like to test this in SILNLP, but it's still using torch 1.10 rather than torch 2.0, which is required for the accelerated attention mechanism in BetterTransformer. I think it might be best for me to make a new branch in SILNLP with torch 2.0 and test using BetterTransformer there. If it does provide a significant speed/memory improvement for our models, I'd imagine we'd want to upgrade the master branch of SILNLP with torch 2.0 and BetterTransformer too.
The upgrades from BetterTransformer will be incorporated into future versions of transformers and should eventually cover the M2M100 that we use.
This should be auto-done when it is put into Transformers natively.
https://pytorch.org/blog/out-of-the-box-acceleration/
Are we utilizing the "fused kernels from FlashAttention and Memory-efficient attention"? Can we? We may be able to have significant speedups or to save significant memory that way.