feat: vram estimator - Githubissues

New feature: vRAM estimator!

Calculate vRAM usage for a given model configuration
Determine maximum context length for a given vRAM constraint
Find the best quantisation setting for a given vRAM and context constraint
Support for different k/v cache quantisation options (fp16, q8_0, q4_0)

To estimate VRAM usage:

gollama --vram --model NousResearch/Hermes-2-Theta-Llama-3-8B --quant q4_k_m --context 2048 --kvcache q4_0 # For GGUF models
gollama --vram --model NousResearch/Hermes-2-Theta-Llama-3-8B --quant 5.0 --context 2048 --kvcache q4_0 # For exl2 models
# Estimated VRAM usage: 5.35 GB

To calculate maximum context for a given memory constraint:

gollama --vram --model NousResearch/Hermes-2-Theta-Llama-3-8B --quant q4_k_m --memory 6 --kvcache q8_0 # For GGUF models
gollama --vram --model NousResearch/Hermes-2-Theta-Llama-3-8B --bpw 5.0 --memory 6 --kvcache q8_0 # For exl2 models
# Maximum context for 6.00 GB of memory: 5069

To find the best BPW:

gollama --vram --model NousResearch/Hermes-2-Theta-Llama-3-8B --memory 6 --quanttype gguf
# Best BPW for 6.00 GB of memory: IQ3_S

The vRAM estimator works by:

Fetching the model configuration from Hugging Face (if not cached locally)
Calculating the memory requirements for model parameters, activations, and KV cache
Adjusting calculations based on the specified quantisation settings
Performing binary and linear searches to optimize for context length or quantisation settings

sammcj / gollama

feat: vram estimator #86