Use llama.cpp or TextGen-webui?

MichaelMartinez commented 10 months ago

Cool project!!!

There are a lot of models out there that will probably perform way better than vanilla llama 2. To get an idea, have a look at this HF space: https://huggingface.co/spaces/mike-ravkine/can-ai-code-results

Also, using a locally run 13b model is relatively trivial at this point, even with modest hardware if you use a quantized version. I am looking your code base to see where a good entry point for something like llama.cpp or textgen-webui could be harnessed.

jawerty commented 10 months ago

Great lmk how that goes I’d love to make it so you can use any model you want. I used the llama 2 chat just to show how good this can work with a limited non fine tuned instruction model and it’s actually not bad. Any ideas for some 13b or less models that could write react really well? I’ll just change that as the default.

On Mon, Aug 21, 2023 at 4:54 PM Michael Martinez @.***> wrote:

Cool project!!!

There are a lot of models out there that will probably perform way better than vanilla llama 2. To get an idea, have a look at this HF space: https://huggingface.co/spaces/mike-ravkine/can-ai-code-results

Also, using a locally run 13b model is relatively trivial at this point, even with modest hardware if you use a quantized version. I am looking your code base to see where a good entry point for something like llama.cpp or textgen-webui could be harnessed.

— Reply to this email directly, view it on GitHub https://github.com/jawerty/AutoStartup/issues/1, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAPIGZ5JKFINUNBVQWP7ZHTXWPDJFANCNFSM6AAAAAA3Y5XHQU . You are receiving this because you are subscribed to this thread.Message ID: @.***>

-- Sent from Gmail Mobile -Jared

metantonio commented 10 months ago

Change this block of code:

from transformers import AutoTokenizer
import transformers
import torch

model = "meta-llama/Llama-2-13b-chat-hf"

tokenizer = AutoTokenizer.from_pretrained(model)
pipeline = transformers.pipeline(
    "text-generation",
    model=model,
    torch_dtype=torch.float16,
    device_map="auto",
)

For this:

from transformers import AutoTokenizer, pipeline, logging
import transformers
import torch
from auto_gptq import AutoGPTQForCausalLM, BaseQuantizeConfig
import argparse

# model = "meta-llama/Llama-2-13b-chat-hf" # this works #needs about 25-30 GB VRAM (GPU)
model = "TheBloke/CodeLlama-7B-Instruct-GPTQ" # try this if you wanna experiment (for GPU under 15GB, using quantized model ;)
model_basename = "model"
use_triton = False

tokenizer = AutoTokenizer.from_pretrained(model, use_fast=True)
tokenizer2= AutoGPTQForCausalLM.from_quantized(model,
        model_basename=model_basename,
        use_safetensors=True,
        trust_remote_code=True,
        device="cuda:0",
        use_triton=use_triton,
        quantize_config=None)

pipeline = pipeline(
  "text-generation",
  model=tokenizer2,
  tokenizer=tokenizer,
  max_new_tokens=4096,
  temperature=0.7,
  top_p=0.95,
  repetition_penalty=1.15
)

With this you can use quantized models, CodeLlama, etc...

jawerty / AutoStartup

Use llama.cpp or TextGen-webui? #1