pytorch-labs / attention-gym

Helpful tools and examples for working with flex-attention
BSD 3-Clause "New" or "Revised" License
483 stars 23 forks source link

Shared memory out of resource #14

Open TechxGenus opened 3 months ago

TechxGenus commented 3 months ago

Thanks for sharing this great resource. I'm trying to run some benchmarks with test_mask from examples/flex_attn.ipynb on one RTX 4090. When I set B=1,H=16,S=2048,D=128, it triggers an error:

triton.runtime.errors.OutOfResources: out of resource: shared memory, Required: 131074, Hardware limit: 101376. Reducing block sizes or `num_stages` may help.
drisspg commented 3 months ago

Thanks for opening this PR up! Admittedly we dont have much CI testing for PyTorch on 4090. Would you mind trying to create a minimal repro on posting it on PyTorch. Feel free to tag me in the issue

TechxGenus commented 3 months ago

Thanks. I create it here: https://github.com/pytorch/pytorch/issues/133254