meta-llama / codellama

Inference code for CodeLlama models
Other
15.4k stars 1.78k forks source link

Context Length and GPU VRAM Usage in CodeLlama-7B #198

Open humza-sami opened 5 months ago

humza-sami commented 5 months ago

I am currently using CodeLlama-7B on an RTX 3090 24GB GPU, and I have a question regarding the relationship between context length and VRAM usage. According to the model documentation, the context length of CodeLlama-7B is 16,384 tokens.

I loaded the model using Hugging Face with 8-bit precision as follows:

from transformers import AutoModelForCausalLM, AutoTokenizer

agent_name = "codellama/CodeLlama-7b-Instruct-hf"
agent = AutoModelForCausalLM.from_pretrained(agent_name, device_map='cuda', load_in_8bit=True)
agent_tokenizer = AutoTokenizer.from_pretrained(agent_name, add_special_tokens=False, add_eos_token=False, add_bos_token=False)

I then tested the model with different input lengths. For a 3000-token input, the GPU VRAM usage was 16GB. However, when I provided a 6000-token input, the GPU VRAM spiked to 22GB. My primary concern is understanding the relationship between context length and VRAM usage.

Code for Reference:

text = 6000 * "hello "
encoded_input = agent_tokenizer(text, return_tensors="pt").to("cuda")
response = agent.generate(**encoded_input, max_new_tokens=4000, do_sample=True, temperature=0.25)

Questions:

  1. Is my understanding correct that the model can handle inputs up to its context length of 16,384 tokens?
  2. Could you provide insights into the observed increase in VRAM usage from a 3000-token input to a 6000-token input?
  3. Considering my intention to use CodeLlama-7B with a context length of up to 8000 tokens, would the 24GB VRAM of my RTX 3090 be sufficient?

Any clarification on these matters would be greatly appreciated. Thank you!