Hope you can help with this. I'm trying to implement ring attention using Llama 3 architecture and I'm starting with the blockwise parallel transformer piece. My question is when do I start to break the input sequence into chunks 1) after projection of weights to Q, K, and V or 2) prior to self-attention in the block?
Hope you can help with this. I'm trying to implement ring attention using Llama 3 architecture and I'm starting with the blockwise parallel transformer piece. My question is when do I start to break the input sequence into chunks 1) after projection of weights to Q, K, and V or 2) prior to self-attention in the block?
Any feedback would be much appreciated :)