microsoft / BitNet

Official inference framework for 1-bit LLMs
MIT License
11.39k stars 769 forks source link

Question - Do you still need enough RAM to run the models? #13

Open augmentedstartups opened 1 month ago

augmentedstartups commented 1 month ago

I assume for a 100B model you need more or 100Gigs of RAM, or does this reduce the ram requirements?

halak0013 commented 1 month ago
# Download the model from Hugging Face, convert it to quantized gguf format, and build the project
python setup_env.py --hf-repo HF1BitLLM/Llama3-8B-1.58-100B-tokens -q i2_s

# Or you can manually download the model and run with local path
huggingface-cli download HF1BitLLM/Llama3-8B-1.58-100B-tokens --local-dir models/Llama3-8B-1.58-100B-tokens
python setup_env.py -md models/Llama3-8B-1.58-100B-tokens -q i2_s

I initially ran these commands on a system with 26GB of RAM, but it caused the system to crash. After adding 20GB of swap space, the process was able to continue. However, it took approximately 20 minutes to complete.

#adding swap
sudo fallocate -l 20G /swapfile
sudo chmod 600 /swapfile
sudo mkswap /swapfile
sudo swapon /swapfile

Currently running on low resources (3.9Gb RAM, 17% CPU)

avcode-exe commented 1 month ago

I allocated 20 GB of ram + 20 GB of swap file for my WSL2 (I run on WSL2).

sd983527 commented 1 month ago

Thanks for the question. In the model conversion phase of the demo, large RAM is still needed. The inference stage will required much less memory and it should be most users' scenario.