Open felipemello1 opened 1 week ago
cc @cpuhrsch @HDCharles I think we could do this with flexattention? Flagging just so you are aware there's interest.
@jcaip - Worth a try. Essentially you'd need to dequant within the score mod (before the softmax) and the inputs will have to be quantized. I think at this point only query and key could be quantized, because values will need to be matmul'd against by the result of the softmax.
hi all, I saw this tweet and thought of sharing it. The accuracy degration doesnt look too good, but maybe the speed makes it worth it?
https://x.com/papers_anon/status/1839131401322639805?s=46
To be clear: I am not requesting the feature, just mostly sharing it. Thanks! :)