Open MichaelMartinez opened 10 months ago
Great lmk how that goes I’d love to make it so you can use any model you want. I used the llama 2 chat just to show how good this can work with a limited non fine tuned instruction model and it’s actually not bad. Any ideas for some 13b or less models that could write react really well? I’ll just change that as the default.
On Mon, Aug 21, 2023 at 4:54 PM Michael Martinez @.***> wrote:
Cool project!!!
There are a lot of models out there that will probably perform way better than vanilla llama 2. To get an idea, have a look at this HF space: https://huggingface.co/spaces/mike-ravkine/can-ai-code-results
Also, using a locally run 13b model is relatively trivial at this point, even with modest hardware if you use a quantized version. I am looking your code base to see where a good entry point for something like llama.cpp or textgen-webui could be harnessed.
— Reply to this email directly, view it on GitHub https://github.com/jawerty/AutoStartup/issues/1, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAPIGZ5JKFINUNBVQWP7ZHTXWPDJFANCNFSM6AAAAAA3Y5XHQU . You are receiving this because you are subscribed to this thread.Message ID: @.***>
-- Sent from Gmail Mobile -Jared
Change this block of code:
from transformers import AutoTokenizer
import transformers
import torch
model = "meta-llama/Llama-2-13b-chat-hf"
tokenizer = AutoTokenizer.from_pretrained(model)
pipeline = transformers.pipeline(
"text-generation",
model=model,
torch_dtype=torch.float16,
device_map="auto",
)
For this:
from transformers import AutoTokenizer, pipeline, logging
import transformers
import torch
from auto_gptq import AutoGPTQForCausalLM, BaseQuantizeConfig
import argparse
# model = "meta-llama/Llama-2-13b-chat-hf" # this works #needs about 25-30 GB VRAM (GPU)
model = "TheBloke/CodeLlama-7B-Instruct-GPTQ" # try this if you wanna experiment (for GPU under 15GB, using quantized model ;)
model_basename = "model"
use_triton = False
tokenizer = AutoTokenizer.from_pretrained(model, use_fast=True)
tokenizer2= AutoGPTQForCausalLM.from_quantized(model,
model_basename=model_basename,
use_safetensors=True,
trust_remote_code=True,
device="cuda:0",
use_triton=use_triton,
quantize_config=None)
pipeline = pipeline(
"text-generation",
model=tokenizer2,
tokenizer=tokenizer,
max_new_tokens=4096,
temperature=0.7,
top_p=0.95,
repetition_penalty=1.15
)
With this you can use quantized models, CodeLlama, etc...
Cool project!!!
There are a lot of models out there that will probably perform way better than vanilla llama 2. To get an idea, have a look at this HF space: https://huggingface.co/spaces/mike-ravkine/can-ai-code-results
Also, using a locally run 13b model is relatively trivial at this point, even with modest hardware if you use a quantized version. I am looking your code base to see where a good entry point for something like llama.cpp or textgen-webui could be harnessed.