guidance-ai / guidance

A guidance language for controlling large language models.
MIT License
18.68k stars 1.03k forks source link

[Feature Request] GPTQ support #205

Open the-xentropy opened 1 year ago

the-xentropy commented 1 year ago

Is your feature request related to a problem? Please describe. VRAM is a major limitation for running most models locally, and guidance by design requires to run models locally to get the most value out of the library. Hence, guidance is in a position to really benefit from supporting quantization approaches - GPTQ is the one in question for this FR.

Describe the solution you'd like There are numerous very interesting quantization libraries. For example:

We already have a pull request for GGML through llama-cpp-python, but we lack equivalent support for GPTQ.

It's actually very easy to get working locally already. For example:

!git clone https://github.com/oobabooga/GPTQ-for-LLaMa.git --depth=1 gptq
import os.path
sys.path.append(os.path.realpath("./gptq")+"/")
import llama_inference
llama_inference.transformers = transformers # the oobabooga fork has a bug in the llama_inference.py file that references it before import. dumb but easy fix
model = llama_inference.load_quant("TheBloke/Wizard-Vicuna-13B-Uncensored-GPTQ","Wizard-Vicuna-13B-Uncensored-GPTQ-4bit-128g.compat.no-act-order.safetensors",4,128).to("cuda:0")
tokenizer = transformers.LlamaTokenizer.from_pretrained("TheBloke/Wizard-Vicuna-13B-Uncensored-GPTQ")

Then, as usual:

import guidance
guidance.llm = guidance.llms.transformers.Vicuna(model=model, tokenizer=tokenizer, device="cuda:0")

However, the above is obviously a very hack-y proof of concept and we might benefit from a more standardized/robust means of supporting GPTQ. Due to the obvious value that being able to load larger models with less VRAM has I think there's a good case to be made for offering some helpers to load GPTQ models more conveniently.

Describe alternatives you've considered We have the option of using oobabooga's fork, or the more directly upstream repo at https://github.com/qwopqwop200/GPTQ-for-LLaMa . Personally, I would opt for oobabooga's fork because it sees less churn and experimental upgrades/changes, and is a key dependency of the extremely popular oobabooga text webui and therefore likely to focus on staying functional.

LoopControl commented 1 year ago

Yes please, this would be great. This would allow loading of upto 30B models with a 24GB VRAM consumer GPU.

Also, instead of using GPTQ-for-Llama, please use AutoGPTQ ( https://github.com/PanQiWei/AutoGPTQ ) instead as it is a higher level library and is literally just 1 line to import:

from auto_gptq import AutoGPTQForCausalLM

model = AutoGPTQForCausalLM.from_quantized(
    quantized_model_dir,
    model_basename=model_basename,
    device="cuda:0",
    use_safetensors=True,
    use_triton=False,
    use_cuda_fp16=False,
)
knoopx commented 1 year ago

@the-xentropy AutoGPTG and GPTQ-for-LLaMa are already supported through the transformers adapter. I can't see why using a dependency is a "hack-y proof of concept". If there's something to be done is on GPTQ-for-LLaMa side, ask them to properly package it as a library.

the-xentropy commented 1 year ago

Hack-y as in "I dynamically altered the imports of a CLI tool to fix a bug in it and make it usable as a library" :)

They're definitely API-wise compatible if you're willing to spend time on it and hunting down weird behaviors, but whether it's convenient or a supported use-case is a different matter. Unfortunately the current state of compatibility and ease-of-use leaves a bit to be desired.

Examples:

1)

For both AutoGPTQ and GPTQ-for-LLaMa the TheBloke/guanaco-33B-GPTQ model will only ever generate one token when used together with Guidance, and turning off acceleration, streaming, token healing, etc, seems to make no difference. Other models seem to work fine - but why? No clue.

Clearly this has to be fixed somewhere in guidance since it's only present in this library, but until there's a decision on how to support GPTQ and if that's even desirable, reports like this are really tricky to triage and (imo) make more sense to think of as edge cases or milestones toward proper GPTQ support.

2)

When loading TheBloke/guanaco-33B-GPTQ AutoGPTQ uses 22 GB of VRAM for a 33b model in 4bit mode.

GPTQ-for-LLaMa takes much longer to load, but uses only 16GB of VRAM, and requires that you manually download the model since it doesn't have the ability to load from the huggingface hub directly.

If they're both are otherwise mostly compatible, do we want to document which GPTQ 'functionality provider' is better suited for use with Guidance? This is not really a technical decision, but since GPTQ is a critical feature for the hobbyist crowd and are extra reliant on ease-of-use features and approachable onboarding since most are not ML experts, guidance could benefit from taking a stance on it and document how to do so in order to boost popularity.

I think in either case, GPTQ support is neither at zero nor at a hundred percent, so until there's a decision on the direction to take GPTQ support in I think the most helpful thing we can do is keep documenting things that work and don't work both for each other's sake as well as potential bugfixes to track long-term.

Glavin001 commented 1 year ago

I'd love to contribute to adding GPTQ support, although I may need some guidance on where to get started (has someone already tried and have a WIP branch I could continue?), any known gotchas to avoid, acceptance criteria, etc.

Is there a reason @LoopControl 's suggestion wouldn't work as follows?

   from transformers import AutoTokenizer, AutoModelForCausalLM
+  from auto_gptq import AutoGPTQForCausalLM

   tokenizer = AutoTokenizer.from_pretrained(model_name)
-  model = AutoModelForCausalLM.from_pretrained(model_name)
+  model = AutoGPTQForCausalLM.from_quantized(
+      quantized_model_dir,
+      model_basename=model_basename,
+      device="cuda:0",
+      use_safetensors=True,
+      use_triton=False,
+      use_cuda_fp16=False,
+  )

  llm = guidance.llms.Transformers(model, tokenizer)

(Normally I'd test instead of ask, I'm unable to test tonight. Might be able to try later this week)

Or is the acceptance criteria actually to specifically support GPTQ-for-LLaMa?

Or is the acceptance criteria the following?

guidance.llm = guidance.llms.AutoGPTQ(
    quantized_model_dir,
    model_basename=model_basename,
    device="cuda:0",
    use_safetensors=True,
    use_triton=False,
    use_cuda_fp16=False,
)

If the latter, should I extend the Transformers class and if so I'm not sure how to support these specialized transformer classes, such as guidance.llms.transformers.LLaMA(...) without duplicating their files -- any ideas?

Disclaimer: Not very familiar with Python, Huggingface transformers, etc so apologies in advance for any silly questions.

freckletonj commented 8 months ago

So, it doesn't seem like much is technically required to implement this, but it is kinda hacky right now, at least when I try with a mixtral gptq.

If I just do:

models.Transformers("mixtral-path")

I get an error that bf16 isn't supported on CPU, of course. But I don't want it to ever go to my cpu.

So I can do this instead:

model = AutoModelForCausalLM.from_pretrained(MODEL_PATH,
                                             device_map="auto",
                                             trust_remote_code=False,
                                             revision=REV)
tokenizer = AutoTokenizer.from_pretrained(MODEL_PATH, use_fast=True)
gmodel = models.transformers.LlamaChat(model=model, tokenizer=tokenizer)

That works, but that feels hacky to use models.transformers.LlamaChat, and then, how can I have control over chat templating? Or do I need to use models.transformers.Llama which I assume doesn't impose a chat template, and then I can choose my chat template? These are questions for which I'll need to dig into the guidance code, and it's a slight infelicity on this library.