pytorch-labs / attention-gym

Helpful tools and examples for working with flex-attention
BSD 3-Clause "New" or "Revised" License
341 stars 13 forks source link

[flex_attention] Softcap perf questions #22

Open meshtag opened 1 month ago

meshtag commented 1 month ago

I am using this pytorch provided script to benchmark flex attention with eager and got the attached results (default_results.txt) on an A100.

I modified the script to change the score_mod function to soft_cap function like this and got the following results. Full result output is attached as well (softcap_results.txt). Speedup is compared with eager (afaict).

=========================================================FWD Speedups========================================================

| Type    |   Speedup | score_mod   | mask_mod   | dtype          | shape(B,Hq,M,Hkv,N,D)       |
|---------|-----------|-------------|------------|----------------|-----------------------------|
| Average |     0.488 |             |            |                |                             |
| Max     |     0.537 | soft_cap    | None       | torch.bfloat16 | (2, 16, 512, 2, 512, 128)   |
| Min     |     0.385 | soft_cap    | None       | torch.bfloat16 | (2, 16, 1024, 16, 1024, 64) |

Is this the expected performance for softcap? Did I miss something?

GPU used: A100 Packages and their versions

torch                    2.5.0.dev20240812+cu118
pytorch-triton           3.0.0+dedb7bdf33

nvcc version:

$ nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2022 NVIDIA Corporation
Built on Wed_Sep_21_10:33:58_PDT_2022
Cuda compilation tools, release 11.8, V11.8.89
Build cuda_11.8.r11.8/compiler.31833905_0

Please feel free to let me know if I need to provide more information. Thanks.

joydddd commented 1 month ago

You might want to use approx tanh instead of torch.tanh() for soft capping. check out this example: attn_gym/mods/softcapping.py

Chillee commented 1 month ago

Yes, definitely, you need to use approx tanh (like in this example: https://github.com/pytorch-labs/attention-gym/blob/5e0d1b8053a19b339ddf7c8015f2c4e02bf5b92c/attn_gym/mods/softcapping.py#L13). Normal tanh in CUDA is extremely expensive.

SmerkyG commented 1 month ago

Isn't the formula for softcapping incorrect as written in the repo?

https://github.com/pytorch-labs/attention-gym/blob/5e0d1b8053a19b339ddf7c8015f2c4e02bf5b92c/attn_gym/mods/softcapping.py#L68

Seems like it should be

    def tanh_softcap(score, b, h, q_idx, kv_idx):
        return soft_cap * tanh(score / soft_cap)
Chillee commented 1 month ago

@SmerkyG Yeah, that's right haha. Do you want to submit a PR :)

SmerkyG commented 1 month ago

Sorry, I don't even use the repo or have it downloaded, was just looking quick and noticed!

drisspg commented 1 month ago

fixed