Open sjelassi opened 1 year ago
Memory usage scales with context size. You need a lot of memory to store the embedding of 128K tokens.
Thanks for your reply ! Have you ever tried to do in-context learning with such a big prompt? If yes, how did you do it?
I believe ALiBi uses less memory at large context lengths, but you can't use it with LLaMA. mpt-7b-storywriter is a model that uses it, but I haven't found it to be very good.
Hi,
I have been running into out of memory issues when trying to generate some text using the model "NousResearch/Yarn-Llama-2-7b-128k". I am using a prompt with 126k tokens and running things on 1 GPU. The script that I am using is the "eval/prompt-loop.py". I tried to set load_in_4bit = True but it didn't help.
Do you have advice to solve this issue?
Thanks !