johnsmith0031 / alpaca_lora_4bit

MIT License
533 stars 84 forks source link

Alpaca Lora 4bit

Made some adjust for the code in peft and gptq for llama, and make it possible for lora finetuning with a 4 bits base model. The same adjustment can be made for 2, 3 and 8 bits.

Install Manual

git clone https://github.com/johnsmith0031/alpaca_lora_4bit.git
cd alpaca_lora_4bit
git fetch origin winglian-setup_pip
git checkout winglian-setup_pip
pip install .

To uninstall and reinstall, run:

cd alpaca_lora_4bit
pip uninstall alpaca_lora_4bit
pip uninstall alpaca_lora_4bit # uninstall again to ensure that you do not have another version
pip install .

For older cards that failed to compile:

git clone https://github.com/johnsmith0031/alpaca_lora_4bit.git
cd alpaca_lora_4bit
git fetch origin old_compatible
git checkout old_compatible
pip install .

Docker

note: Currently does not work

Quick start for running the chat UI

git clone https://github.com/johnsmith0031/alpaca_lora_4bit.git
cd alpaca_lora_4bit
DOCKER_BUILDKIT=1 docker build -t alpaca_lora_4bit . # build step can take 12 min
docker run --gpus=all -p 7860:7860 alpaca_lora_4bit

Point your browser to http://localhost:7860

Results

It's fast on a 3070 Ti mobile. Uses 5-6 GB of GPU RAM.

Update Logs

Finetune

After installation, this script can be used.

python finetune.py ./data.txt \
    --ds_type=txt \
    --lora_out_dir=./test/ \
    --llama_q4_config_dir=./llama-7b-4bit/ \
    --llama_q4_model=./llama-7b-4bit.pt \
    --mbatch_size=1 \
    --batch_size=1 \
    --epochs=3 \
    --lr=3e-4 \
    --cutoff_len=256 \
    --lora_r=8 \
    --lora_alpha=16 \
    --lora_dropout=0.05 \
    --warmup_steps=5 \
    --save_steps=50 \
    --save_total_limit=3 \
    --logging_steps=5 \
    --groupsize=128 \
    --xformers \
    --backend=cuda

Inference

After installation, this script can be used:

python inference.py

Text Generation Webui Monkey Patch

Clone the latest version of text generation webui and copy all the files into ./text-generation-webui/

git clone https://github.com/oobabooga/text-generation-webui.git

Open server.py and insert a line at the beginning

import custom_monkey_patch # apply monkey patch
...

Use the command to run

python server.py

monkey patch inside webui

Currently the webui support using this repo by the monkeypatch inside it.
You can simply clone this repo to ./repositories/ in the path of text generation webui.

Flash Attention

Currently with flash attention 2 support, and directly use flash_attn_func function. Only support Llama / Llama 2 based model now.

Xformers

Quant Attention and MLP Patch

Note: Currently does not support peft lora, but can use inject_lora_layers to load simple lora with only q_proj and v_proj.

Usage:

from model_attn_mlp_patch import make_quant_attn, make_fused_mlp, inject_lora_layers
make_quant_attn(model)
make_fused_mlp(model)

# Lora
inject_lora_layers(model, lora_path)

Only for faster lora:

from monkeypatch.gptq_for_llala_lora_monkey_patch import inject_lora_layers
inject_lora_layers(model, lora_path, device, dtype)

Model Server

Better inference performance with text_generation_webui, about 40% faster

Simple expriment results:
7b model with groupsize=128 no act-order
improved from 13 tokens/sec to 20 tokens/sec

Step:

  1. run model server process
  2. run webui process with monkey patch

Example

run_server.sh

#!/bin/bash

export PYTHONPATH=$PYTHONPATH:./

CONFIG_PATH=
MODEL_PATH=
LORA_PATH=

VENV_PATH=
source $VENV_PATH/bin/activate
python ./scripts/run_server.py --config_path $CONFIG_PATH --model_path $MODEL_PATH --lora_path $LORA_PATH --groupsize=128 --quant_attn --port 5555 --pub_port 5556

run_webui.sh

#!/bin/bash

if [ -f "server2.py" ]; then
    rm server2.py
fi
echo "import custom_model_server_monkey_patch" > server2.py
cat server.py >> server2.py

export PYTHONPATH=$PYTHONPATH:../

VENV_PATH=
source $VENV_PATH/bin/activate
python server2.py --chat --listen

Note: