keldenl / gpt-llama.cpp

A llama.cpp drop-in replacement for OpenAI's GPT endpoints, allowing GPT-powered apps to run off local llama.cpp models instead of OpenAI.
MIT License
592 stars 66 forks source link

gpt-llama.cpp

gpt-llama.cpp logo

discord npm version npm downloads license

Replace OpenAi's GPT APIs with llama.cpp's supported models locally

Demo

Demo GIF Real-time speedy interaction mode demo of using gpt-llama.cpp's API + chatbot-ui (GPT-powered app) running on a M1 Mac with local Vicuna-7B model. See all demos here.

Discord

Join our Discord Server community for the latest updates and to chat with the community (200+ members and growing): https://discord.gg/yseR47MqpN

πŸ”₯ Hot Topics (5/7) πŸ”₯

Description

gpt-llama.cpp is an API wrapper around llama.cpp. It runs a local API server that simulates OpenAI's API GPT endpoints but uses local llama-based models to process requests.

It is designed to be a drop-in replacement for GPT-based applications, meaning that any apps created for use with GPT-3.5 or GPT-4 can work with llama.cpp instead.

The purpose is to enable GPT-powered apps without relying on OpenAI's GPT endpoint and use local models, which decreases cost (free) and ensures privacy (local only).

Supported platforms

Features

gpt-llama.cpp provides the following features:

Supported applications

The following applications (list growing) have been tested and confirmed to work with gpt-llama.cpp without requiring code changes:

More applications are currently being tested, and welcome requests for verification or fixes by opening a new issue in the repo.

See all demos here.

Quickstart Installation

πŸ”΄πŸ”΄ ⚠️ DO NOT SKIP THE PREREQUISITE STEP ⚠️ πŸ”΄πŸ”΄

Prerequisite

Set up llama.cpp

Setup llama.cpp by following the instructions below. This is based on the llama.cpp README. You may skip if you have llama.cpp set up already.

Mac
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
make

# install Python dependencies
python3 -m pip install -r requirements.txt
Windows
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
# install Python dependencies
python3 -m pip install -r requirements.txt
Test llama.cpp

Confirm that llama.cpp works by running an example. Replace with your llama model, typically named something like ggml-model-q4_0.bin

# Mac
./main -m models/7B/<YOUR_MODEL_BIN> -p "the sky is"

# Windows
main -m models/7B/<YOUR_MODEL_BIN> -p "the sky is"

It'll start spitting random BS, but you're golden if it's responding. You may now move on to running gpt-llama.cpp itself now.

Running gpt-llama.cpp

Run Locally

  1. Clone the repository:

    git clone https://github.com/keldenl/gpt-llama.cpp.git
    cd gpt-llama.cpp
    • Strongly recommended folder structure
      documents
      β”œβ”€β”€ llama.cpp
      β”‚   β”œβ”€β”€ models
      β”‚   β”‚   └── <YOUR_.BIN_MODEL_FILES_HERE>
      β”‚   └── main
      └── gpt-llama.cpp
  2. Install the required dependencies:

    npm install
  3. Start the server!

    # Basic usage
    npm start 

You're done! Here are some more advanced configs you can run:

   # To run on a different port
   # Mac
   PORT=8000 npm start

   # Windows cmd
   set PORT=8000
   npm start

   # Use llama.cpp flags (use it without the "--", so instead of "--mlock" do "mlock")
   npm start mlock threads 8 ctx_size 1000 repeat_penalty 1 lora ../path/lora

   # To use sentence transformers instead of llama.cpp based embedding set EMBEDDINGS env var to "py"
   # Mac
   EMBEDDINGS=py npm start

   # Windows cmd
   set EMBEDDINGS=py
   npm start

Usage

Test your installation

You have 2 options:

  1. Open another terminal window and test the installation by running the below script, make sure you have a llama .bin model file ready. Test the server by running the scripts/test-installation script (currently only supports Mac)

    # Mac
    sh ./test-installion.sh
  2. Access the Swagger API docs at http://localhost:443/docs to test requests using the provided interface. Note that the authentication token needs to be set to the path of your local llama-based model (i.e. for mac, "/Users/<YOUR_USERNAME>/Documents/llama.cpp/models/vicuna/7B/ggml-vicuna-7b-4bit-rev1.bin") for the requests to work properly.

API Documentation

Running a GPT-Powered App

There are 2 ways to set up a GPT-powered app:

  1. Use a documented GPT-powered application by following supported applications directions.

  2. Use a undocumented GPT-powered application by checking if they support openai.api_base:

    • Update the openai_api_key slot in the gpt-powered app to the absolute path of your local llama-based model (i.e. for mac, "/Users/<YOUR_USERNAME>/Documents/llama.cpp/models/vicuna/7B/ggml-vicuna-7b-4bit-rev1.bin").
    • Change the BASE_URL for the OpenAi endpoint the app is calling to localhost:443 or localhost:443/v1. This is sometimes provided in the .env file, or would require manual updating within the app OpenAi calls depending on the specific application.

Obtaining and verifying the Facebook LLaMA original model and Stanford Alpaca model data

Contributing

You can contribute to gpt-llama.cpp by creating branches and pull requests to merge. Please follow the standard process for open sourcing.

License

This project is licensed under the MIT License. See the LICENSE file for more details.