NLLB currently supports FlashAttention in HF Transformers. Unfortunately, FlashAttention results in degradation of quality, because it does not properly support padding masks. SDPA provides an alternative route for applying attention optimizations. Under the hood, it supports FlashAttention and Memory Efficient Attention. Memory Efficient Attention should support masking. Here is the issue for adding SDPA support to models in Transformers. For a list of currently supported models, check out the Transformers documentation. A good example to follow would be BART, which has a full encoder-decoder architecture. It might also be useful to check out this PR that adds SDPA support to T5, another encoder-decoder LLM.
NLLB currently supports FlashAttention in HF Transformers. Unfortunately, FlashAttention results in degradation of quality, because it does not properly support padding masks. SDPA provides an alternative route for applying attention optimizations. Under the hood, it supports FlashAttention and Memory Efficient Attention. Memory Efficient Attention should support masking. Here is the issue for adding SDPA support to models in Transformers. For a list of currently supported models, check out the Transformers documentation. A good example to follow would be BART, which has a full encoder-decoder architecture. It might also be useful to check out this PR that adds SDPA support to T5, another encoder-decoder LLM.