An Issue on Reproducing Streamingllm

Thank you for your interest in our work :) I did run into a similar issue, apologies for not including the fixes in the original repo. To fix this issue, I applied a patch on top of the original streaming_llm repo. To apply the fix:

clone the original repo
replace the file streaming_llm/pos_shift/modify_llama.py with this file; the only changes I made to this file is adding the padding_mask argument to the pos_shift forward function, and also added supported for FlashAttention2.
Install this version of the streaming_llm from source via python setup.py develop in the base directory. Alternatively, you can also just add this new version to your path (in case you don't want to go through the installation process).

This should work with transformers==4.34.1, which is what I tested it on. There are other libraries (e.g., AttentionSink) that also implement this, but I cannot speak to how they perform compared to the original repo. If you do get a chance to check them out, please let me know how they go.

Please let me know if you have any other questions :)

princeton-nlp / CEPE

An Issue on Reproducing Streamingllm #3