sammcj / gollama

Go manage your Ollama models
https://smcleod.net
MIT License
460 stars 29 forks source link

feat: vram estimator #86

Closed sammcj closed 3 months ago

sammcj commented 3 months ago

New feature: vRAM estimator!

To estimate VRAM usage:

gollama --vram --model NousResearch/Hermes-2-Theta-Llama-3-8B --quant q4_k_m --context 2048 --kvcache q4_0 # For GGUF models
gollama --vram --model NousResearch/Hermes-2-Theta-Llama-3-8B --quant 5.0 --context 2048 --kvcache q4_0 # For exl2 models
# Estimated VRAM usage: 5.35 GB

To calculate maximum context for a given memory constraint:

gollama --vram --model NousResearch/Hermes-2-Theta-Llama-3-8B --quant q4_k_m --memory 6 --kvcache q8_0 # For GGUF models
gollama --vram --model NousResearch/Hermes-2-Theta-Llama-3-8B --bpw 5.0 --memory 6 --kvcache q8_0 # For exl2 models
# Maximum context for 6.00 GB of memory: 5069

To find the best BPW:

gollama --vram --model NousResearch/Hermes-2-Theta-Llama-3-8B --memory 6 --quanttype gguf
# Best BPW for 6.00 GB of memory: IQ3_S

The vRAM estimator works by:

  1. Fetching the model configuration from Hugging Face (if not cached locally)
  2. Calculating the memory requirements for model parameters, activations, and KV cache
  3. Adjusting calculations based on the specified quantisation settings
  4. Performing binary and linear searches to optimize for context length or quantisation settings