Alibi Positionings + xpos?

lucidrains / flash-genomics-model

My own attempt at a long context genomics model, leveraging recent advances in long context attention modeling (Flash Attention + other hierarchical methods)

MIT License

52 stars 5 forks source link

Alibi Positionings + xpos? #5

Closed kyegomez closed 1 year ago

kyegomez commented 1 year ago

Could we also integrate alibi + xpos to increase context length even further? What about integrating one write head as well?

lucidrains commented 1 year ago

i will probably not use alibi and xpos as they enforce an exponential decay, and they are both designed for causal and not bidirectional attention

similarly, one write head is only useful in the autoregressive case, for decreasing the amount of key / values needed to be cached on inference

kyegomez commented 1 year ago

What are other methods that can be used to increase context length?

Have you seen the new blockwise parallel attention? Do you believe it could work? I've been trying to implement it here https://github.com/kyegomez/Blockwise-Parallel-Transformer

lucidrains commented 1 year ago

@kyegomez yea, that paper would help with memory for sure

but by simply chunking (as in reformer) and recomputing each chunk in the feedforward, you can get the same effect, which amounts to about 10 loc

i also think this paper should have some more follow up exploration. if anyone can find a way to make it scale as well as transformers, you can just use flash attention throughout.

lucidrains commented 1 year ago

ok, i'm closing this, as it isn't really an issue

feel free to add to the discussions though

kyegomez commented 1 year ago

@kyegomez yea, that paper would help with memory for sure

but by simply chunking (as in reformer) and recomputing each chunk in the feedforward, you can get the same effect, which amounts to about 10 loc

i also think this paper should have some more follow up exploration. if anyone can find a way to make it scale as well as transformers, you can just use flash attention throughout.

Do you think that using an attention similar to blockwise with mega byte like encoding/handling/processing would accelerate the context length even further?