Reduce memory use through better attention?

johnml1135 commented 9 months ago

https://pytorch.org/blog/out-of-the-box-acceleration/

Are we utilizing the "fused kernels from FlashAttention and Memory-efficient attention"? Can we? We may be able to have significant speedups or to save significant memory that way.

mshannon-sil commented 9 months ago

This seems like it would be pretty useful for our purposes since we're training large models. The pytorch blog says they're seeing GPU memory savings of 20%-110% during training, speedups of 10%-70% during training, and speedups of 5%-20% during inferencing. I'd like to test this in SILNLP, but it's still using torch 1.10 rather than torch 2.0, which is required for the accelerated attention mechanism in BetterTransformer. I think it might be best for me to make a new branch in SILNLP with torch 2.0 and test using BetterTransformer there. If it does provide a significant speed/memory improvement for our models, I'd imagine we'd want to upgrade the master branch of SILNLP with torch 2.0 and BetterTransformer too.

mshannon-sil commented 8 months ago

The upgrades from BetterTransformer will be incorporated into future versions of transformers and should eventually cover the M2M100 that we use.

johnml1135 commented 7 months ago

This should be auto-done when it is put into Transformers natively.

sillsdev / machine.py

Reduce memory use through better attention? #73