Open cmunna0052 opened 15 hours ago
Llama 3.1 has a default context size of 131072 tokens, which will consume a considerable amount of VRAM for the cache. Have you tried loading it with a smaller sequence length?
I don't understand -- shouldn't the context length be set by the length of the prompt + the number of newly generated tokens? I expected it to be very short because it only starts with "Once upon a time" and "tokens" is set to 128
OS
Linux
GPU Library
CUDA 12.x
Python version
3.12
Pytorch version
2.4.1
Model
LoneStriker/Meta-Llama-3.1-70B-Instruct-2.4bpw-h6-exl2
Describe the bug
The test script is yielding an out of memory error on a model that should be well within the limit of a 46GB EC2 instance:
Reproduction steps
Download the model:
Run the script
Expected behavior
The model would run and generate.
Logs
No response
Additional context
No response
Acknowledgements