This project uses llama.cpp to load model from a local file, delivering fast and memory-efficient inference.
The project is currently designed for Google Gemma, and will support more models in the future.
Download Gemma model from Google repository (https://huggingface.co/google/gemma-2b-it).
Quantize the Gemma model (highly recommended if target machine has limited memory).
screen -S "webui" bash ./start-ui.sh