Run AI models locally on your machine
Pre-built bindings are provided with a fallback to building from source with cmake✨ New! Try the beta of version 3.0.0
✨ (included: function calling, automatic chat wrapper detection, embedding support, and more)
node-gyp
or Pythonllama.cpp
. Download and compile the latest release with a single CLI command.npm install --save node-llama-cpp@beta.34
This package comes with pre-built binaries for macOS, Linux and Windows.
If binaries are not available for your platform, it'll fallback to download the latest version of llama.cpp
and build it from source with cmake
.
To disable this behavior set the environment variable NODE_LLAMA_CPP_SKIP_DOWNLOAD
to true
.
Note the prev --cuda flag got replaced by --gpu cuda --gpu amd
npx --no node-llama-cpp download --gpu cuda
const model = new LlamaModel({
modelPath,
gpuLayers: 64 // or any other number of layers you want
});
You'll see logs like these in the console when the model loads:
llm_load_tensors: ggml ctx size = 0.09 MB
llm_load_tensors: using CUDA for GPU acceleration
llm_load_tensors: mem required = 41.11 MB (+ 2048.00 MB per state)
llm_load_tensors: offloading 32 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloading v cache to GPU
llm_load_tensors: offloading k cache to GPU
llm_load_tensors: offloaded 35/35 layers to GPU
llm_load_tensors: VRAM used: 4741 MB
On Linux, you can monitor GPU usage with this command:
watch -d nvidia-smi
import {fileURLToPath} from "url";
import path from "path";
import {getLlama, LlamaChatSession} from "node-llama-cpp";
const __dirname = path.dirname(fileURLToPath(import.meta.url));
const llama = await getLlama();
const model = await llama.loadModel({
modelPath: path.join(__dirname, "models", "dolphin-2.1-mistral-7b.Q4_K_M.gguf")
});
const context = await model.createContext();
const session = new LlamaChatSession({
contextSequence: context.getSequence()
});
const q1 = "Hi there, how are you?";
console.log("User: " + q1);
const a1 = await session.prompt(q1);
console.log("AI: " + a1);
const q2 = "Summarize what you said";
console.log("User: " + q2);
const a2 = await session.prompt(q2);
console.log("AI: " + a2);
For more examples, see the getting started guide
To contribute to node-llama-cpp
read the contribution guide.
If you like this repo, star it ✨