llama2.rs 🤗

This is a Rust implementation of Llama2 inference on CPU

The goal is to be as fast as possible.

It has the following features:

Support for 4-bit GPT-Q Quantization
Batched prefill of prompt tokens
SIMD support for fast CPU inference
Memory mapping, loads 70B instantly.
Static size checks for safety
Support for Grouped Query Attention (needed for big Llamas)
Python calling API

Can run up on 1 tok/s 70B Llama2 and 9 tok/s 7B Llama2. (on my intel i9 desktop)

To build, you'll need the nightly toolchain, which is used by default:

> rustup toolchain install nightly # to get nightly
> ulimit -s 10000000 # Increase your stack memory limit.

You can load models from the Hugging Face hub. For example this creates a version of a 70B quantized) model with 4 bit quant and 64 sized groups:

> pip install -r requirements.export.txt
> python export.py l70b.act64.bin TheBloke/llama-2-70b-Guanaco-QLoRA-GPTQ gptq-4bit-64g-actorder_True

The library needs to be recompiled to match the model. You can do this with cargo.

To run:

> cargo run --release --features 70B,group_64,quantized -- -c llama2-70b-q.bin -t 0.0 -s 11 -p "The only thing"                                                                                                                                 
The only thing that I can think of is that the          
achieved tok/s: 0.89155835

Honestly, not so bad for running on my GPU machine, significantly faster than llama.c.

Here's a run of 13B quantized:

> cargo run --release --features 13B,group_128,quantized -- -c l13orca.act.bin -t 0.0 -s 25 -p "Hello to all the cool people out there who "
Hello to all the cool people out there who are reading this. I hope you are having a great day. I am here
achieved tok/s: 5.1588936

Here's a run of 7B quantized:

cargo run --release --features 7B,group_128,quantized -- -c l7.ack.bin -t 0.0 -s 25 -p "Hello to all the cool people out there who "
> Hello to all the cool people out there who are reading this. I am a newbie here and I am looking for some
achieved tok/s: 9.048136

Python

To run in Python, you need to first compile from the main directory with the python flag.

cargo build --release --features 7B,group_128,quantized,python
pip install .

You can then run the following code.

import llama2_rs

def test_llama2_13b_4_128act_can_generate():
    model = llama2_rs.LlamaModel("lorca13b.act132.bin", False)
    tokenizer = llama2_rs.Tokenizer("tokenizer.bin")
    random = llama2_rs.Random()
    response = llama2_rs.generate(
        model,
        tokenizer,
        "Tell me zero-cost abstractions in Rust ",
        50,
        random, 
        0.0
    )

Todos

[ ] Support fast GPU processing with Triton
[ ] Support https://github.com/oobabooga/text-generation-webui
[ ] Documentation
[ ] Blog Post about the methods for fast gptq
[ ] Remove dependency on AutoGPTQ for preloading
[ ] Support for safetensors directly.

Configuration

In order to make the model as fast as possible, you need to compile a new version to adapt to other Llama versions. Currently in .cargo/config. The model will fail if these disagree with the binary model that is being loaded. To turn quantization off set it to quant="no".

How does it work?

Started as a port of the original code, with extra type information to make it easier to extend.

There are some dependencies:

memmap2for memory mapping
rayon for parallel computation.
clap for command-line args.
pyO3 for python calling
SIMD enabled support with portable_simd

Authors

Llama2.rs is written by @srush and @rachtsingh.

srush / llama2.rs

readme