[Feature request] End-to-end transformer example with flex attention

pytorch-labs / attention-gym

Helpful tools and examples for working with flex-attention

BSD 3-Clause "New" or "Revised" License

470 stars 23 forks source link

I'd love to see a clean example of a transformer that integrates flex attention. I haven't found any samples that do this.

For reference, I have a transformer-based model that uses the TransformerEncoderLayer + TransformerEncoder API. I hacked up a copy-pasted TransformerEncoderLayer class to add a few tweaks (with significant predictive improvement) that could instead be rewritten as a score_mod() function. I'm guessing that using flex attention should result in both cleaner and more performant code. I was hoping that this repo could include a best-practice reference.

pytorch-labs / attention-gym

[Feature request] End-to-end transformer example with flex attention #42