Open meshtag opened 1 month ago
You might want to use approx tanh instead of torch.tanh() for soft capping. check out this example: attn_gym/mods/softcapping.py
Yes, definitely, you need to use approx tanh (like in this example: https://github.com/pytorch-labs/attention-gym/blob/5e0d1b8053a19b339ddf7c8015f2c4e02bf5b92c/attn_gym/mods/softcapping.py#L13). Normal tanh
in CUDA is extremely expensive.
Isn't the formula for softcapping incorrect as written in the repo?
Seems like it should be
def tanh_softcap(score, b, h, q_idx, kv_idx):
return soft_cap * tanh(score / soft_cap)
@SmerkyG Yeah, that's right haha. Do you want to submit a PR :)
Sorry, I don't even use the repo or have it downloaded, was just looking quick and noticed!
fixed
I am using this pytorch provided script to benchmark flex attention with eager and got the attached results (default_results.txt) on an A100.
I modified the script to change the
score_mod
function tosoft_cap
function like this and got the following results. Full result output is attached as well (softcap_results.txt). Speedup is compared with eager (afaict).Is this the expected performance for softcap? Did I miss something?
GPU used: A100 Packages and their versions
nvcc version:
Please feel free to let me know if I need to provide more information. Thanks.