Great job, I found two problems when trying to reproduce the paper's results.
The same positiona embedding was used for all context memory units as explained in the paper. But I found in code implementation, there seems no use of position embedding for cached Ks at all?
Why chunk size? The proposed method does the attention block by block, which (I think) wouldn't cause OOM errors even without the chunking trick in decoding. But I found it fail to process 100K text without setting chunk size, while using flash attn is totaly fine in such circumstances.
Great job, I found two problems when trying to reproduce the paper's results.
The same positiona embedding was used for all context memory units as explained in the paper. But I found in code implementation, there seems no use of position embedding for cached Ks at all?
Why
chunk size
? The proposed method does the attention block by block, which (I think) wouldn't cause OOM errors even without the chunking trick in decoding. But I found it fail to process 100K text without settingchunk size
, while usingflash attn
is totaly fine in such circumstances.