Closed kyegomez closed 1 year ago
i will probably not use alibi and xpos as they enforce an exponential decay, and they are both designed for causal and not bidirectional attention
similarly, one write head is only useful in the autoregressive case, for decreasing the amount of key / values needed to be cached on inference
What are other methods that can be used to increase context length?
Have you seen the new blockwise parallel attention? Do you believe it could work? I've been trying to implement it here https://github.com/kyegomez/Blockwise-Parallel-Transformer
@kyegomez yea, that paper would help with memory for sure
but by simply chunking (as in reformer) and recomputing each chunk in the feedforward, you can get the same effect, which amounts to about 10 loc
i also think this paper should have some more follow up exploration. if anyone can find a way to make it scale as well as transformers, you can just use flash attention throughout.
ok, i'm closing this, as it isn't really an issue
feel free to add to the discussions though
@kyegomez yea, that paper would help with memory for sure
but by simply chunking (as in reformer) and recomputing each chunk in the feedforward, you can get the same effect, which amounts to about 10 loc
i also think this paper should have some more follow up exploration. if anyone can find a way to make it scale as well as transformers, you can just use flash attention throughout.
Do you think that using an attention similar to blockwise with mega byte like encoding/handling/processing would accelerate the context length even further?
Could we also integrate alibi + xpos to increase context length even further? What about integrating one write head as well?