haoliuhl / ringattention

Transformers with Arbitrarily Large Context
Apache License 2.0
630 stars 50 forks source link

Llama 3 ring attention implementation for inference #21

Open joshpopelka20gmail opened 3 months ago

joshpopelka20gmail commented 3 months ago

Hope you can help with this. I'm trying to implement ring attention using Llama 3 architecture and I'm starting with the blockwise parallel transformer piece. My question is when do I start to break the input sequence into chunks 1) after projection of weights to Q, K, and V or 2) prior to self-attention in the block?

Any feedback would be much appreciated :)

haoliuhl commented 2 months ago

Hi, ringattention inference has been supported in LWM