OOM when doing text generation

jquesnelle / yarn

YaRN: Efficient Context Window Extension of Large Language Models

MIT License

1.32k stars 115 forks source link

OOM when doing text generation #21

Open sjelassi opened 1 year ago

sjelassi commented 1 year ago

Hi,

I have been running into out of memory issues when trying to generate some text using the model "NousResearch/Yarn-Llama-2-7b-128k". I am using a prompt with 126k tokens and running things on 1 GPU. The script that I am using is the "eval/prompt-loop.py". I tried to set load_in_4bit = True but it didn't help.

Do you have advice to solve this issue?

Thanks !

cebtenzzre commented 1 year ago

Memory usage scales with context size. You need a lot of memory to store the embedding of 128K tokens.

sjelassi commented 1 year ago

Thanks for your reply ! Have you ever tried to do in-context learning with such a big prompt? If yes, how did you do it?

cebtenzzre commented 1 year ago

I believe ALiBi uses less memory at large context lengths, but you can't use it with LLaMA. mpt-7b-storywriter is a model that uses it, but I haven't found it to be very good.