unslothai / unsloth

Finetune Llama 3.1, Mistral, Phi & Gemma LLMs 2-5x faster with 80% less memory
https://unsloth.ai
Apache License 2.0
15.23k stars 1.02k forks source link

[Feature request] Support GPTQ quantization #39

Open araleza opened 8 months ago

araleza commented 8 months ago

So I have a GPTQ llama model I downloaded (from TheBloke), and it's already 4 bit quantized. I have to pass in False for the load_in_4bit parameter of:

model, tokenizer = FastLlamaModel.from_pretrained(

because if I don't, I get an error thrown saying:

The model is already quantized with gptq. You can't quantize it again with bitsandbytes

But, if I pass in False for load_in_4bit, this code makes bnb_config be None:

        bnb_config = None
        if load_in_4bit:
            bnb_config = BitsAndBytesConfig(
                load_in_4bit              = True,
                bnb_4bit_use_double_quant = True,
                bnb_4bit_quant_type       = "nf4",
                bnb_4bit_compute_dtype    = dtype,
            )

and that makes quantization_config be None as well:

quantization_config = bnb_config,

and that crashes here:

        if hasattr(self, "quantization_config"):
            output["quantization_config"] = (
                self.quantization_config.to_dict()

with the error message:

'NoneType' object has no attribute 'to_dict'

So I'm not sure how to LoRA train this llama model. Any thoughts?

araleza commented 8 months ago

I tried adding:

[...] and self.quantization_config is not None:

to the end of that line there (and similar additions in two other places that came up), and it hasn't crashed, but it's now taking a very long time to load the model, so maybe it's doing some unwanted conversion?

araleza commented 8 months ago

Yeah, it finally 'loaded' but then it said some weights of the model checkpoint were not used when initializing LlamaForCausalLM, and it listed a giant list of weights, which I'm guessing was all of them.

The the LoRA training crashed with:

Cannot copy out of meta tensor; no data!

So something definitely did not go well.

danielhanchen commented 8 months ago

@araleza Oh no I don't think GPTQ models are supported as of yet

danielhanchen commented 8 months ago

Currently only QLoRA via bitsandbytes is supported, hence all the error messages. If GPTQ is a super popular request, I will add it in - the dequantization steps will just be replaced, but I will have to read up on how GPTQ does it internally.

For now, is it possible to use a non GPTQ quantized model?

araleza commented 8 months ago

For now, is it possible to use a non GPTQ quantized model?

I don't know actually... I've only done LoRA training with oobabooga's Training tab, and it can only do LoRA training with unquantized models, or GPTQ models (which you have to load with the Transformers loader). So I don't know how to load a quantized model of any format except GPTQ onto my GPU. Any tips for which format to use instead, but still have it fit on my 24GB GPU?

danielhanchen commented 8 months ago

@araleza Would it be possible to try load a non quantized model, then pass load_in_4bit = True via Unsloth? It should load on ur CPU / RAM then it quantizes then loads it into the GPU

danielhanchen commented 8 months ago

I'll see for a future release if I can add GPTQ support!

danielhanchen commented 8 months ago

I was atually just reading up upon HQQ (half quadratic quantization) https://github.com/mobiusml/hqq and maybe I'll be adding HQQ instead of GPTQ since HQQ has no need for data calibration, whilst GPTQ does.

araleza commented 8 months ago

Sounds good. I think you've got two groups of people who want to use your software:

1) people who have a big model and big training data, and want the fine tuning to be faster 2) people with 24GB cards who want to train larger models, but without quantizing them so badly that the training is meaningless.

Supporting HQQ would help the people in group 2, like me.

danielhanchen commented 8 months ago

@araleza Cool I'll get on with HQQ! It seems like even Mixtral can supposedly fit on a 24GB card!

But HQQ supports 8, 4, 3 and 2 bit quantization so it'll be pretty useful!

jeromeku commented 7 months ago

@danielhanchen happy to pitch in with quantization (or other feature requests). let me know how best to contribute!

danielhanchen commented 7 months ago

@jeromeku More than happy to collaborate! I was actually taking a look at GPTQ the other day - I guess technically Unsloth can add in GPTQ during training - we we need is to port the dequantization kernels from GPTQ to float16 / bfloat16, and if that works, then GPTQ will be supported.

For all, I'm using bitsandbytes's dequantization kernels.

Again more than happy to collaborate if you're interested!

jeromeku commented 7 months ago

@danielhanchen That should work -- this is what QLoRA does under the hood for non-LoRA weights right? I.e., dequantizes 'frozen' weights to f16 / bf16 in order to pass grads through non-LoRA layers.

I can take a crack at this if you're more keen working on hqq...

danielhanchen commented 7 months ago

@jeromeku I'll investigate GPTQ's dequant kernels as well! But if you're interested in adding GPTQ support - I'm more than happy for a few more OSS collaborators!

Essentially in terms of the main gist of things:

  1. Find how GPTQ dequantizes its quantized weights to float16 / bfloat16
  2. Extract this functionality from say Huggingface internals or some other provider like Exllama / llama.cpp etc
  3. Replace fast_dequantize with GPTQ equivalent kernels
  4. Fix up a few lines where Linear4bit naming conventions are seen with GPTQ equivalent conventions.
  5. If 3 works as is, then Unsloth is now GPTQ compatible!

If you wanna take a crack at that - I'll be super grateful! In fact just step 1 or 2 is enough for a general GPTQ integration!

jeromeku commented 7 months ago

@danielhanchen Will work on it!

danielhanchen commented 7 months ago

@jeromeku Great! If you need any help - ask away! I guess we can use this Github issue as a central discussion area. I'll see if I have some time on GPTQ - probably next week ish - I'm trying to work on some other stuff currently.

Again thanks!

jeromeku commented 7 months ago

@danielhanchen

Trying to understand design decisions / coding style of the library.

What is the purpose of patching {Mistral, Llama}_fast_forward when initializing Mistral (pre_patch)? It seems you are extracting sections directly from the original HF implementations of these layers (which already support flash-attn2) and in some cases using xformers for some of the ops.

Why the use of pass after every function? This is (AFAIK) a rather unconventional python coding style?

danielhanchen commented 7 months ago

@jeromeku prepatch essentially just patches some portions of each function to call their relevant efficient implementation - ie as you mentioned some xformers some FA2.

Oh ye sorry on my coding style - I came from like like C++ / C background so I generally like all functions / if / for loops etc to be "enclosed" to make it "look" compartmentalized.

But you can have whatever coding style you like - for eg I like spaces between eqals during variance assignments, whilst general style is var=2 and not var = 2. It definitely comes from my C background!!

If you're contributing code - I don't mind on style - that's the least of worries! :)) You can use any style you desire - it just has to work :)

jeromeku commented 7 months ago

@danielhanchen

Any tools / tests you use to check the correctness of gradient implementations?

danielhanchen commented 7 months ago

@jeromeku Oh lol what I do is to get HF to do training, copy paste the training losses to Google Sheets, then with ur updated gradient implementation, log if the new training loss is mostly identical.

Another approach is to use torch.dist or torch.all_close on W.grad and new_W.grad to confirm the gradients. You'll have to do loss.backward(Y) for eg to get the gradients.

jeromeku commented 7 months ago

@danielhanchen

Ok, was wondering if there was a more efficient way to do this verification. Was trying to use torch.autograd.gradcheck but runs into issues with large inputs / outputs and mixed precision since it needs to realize the full VJP during numerical / analytical gradient calc.

I've adapted GPTQ code to re-implement fast_lora custom fwd / bwd and should have the rest done by early next week.

A minimal way to check the gradient is being calculated correctly -- akin to a unit test -- without having to do a training run would be a worthwhile effort both for existing and future implementations.

danielhanchen commented 7 months ago

@jeromeku Actually I did technically make some functions to check gradients somewhere - I manaully made some random inputs and some random outputs, then backpropagated with torch.backward(outputs), and checked every item's .grad to confirm it - I just need to find where I wrote it :))

jeromeku commented 7 months ago

@danielhanchen

I wrote a small test script to do gradient checking:

import torch
from datasets import load_dataset

# 4bit pre quantized models we support for 4x faster downloading!
from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training
from torch.utils.data import DataLoader

from unsloth import FastLanguageModel

DTYPE = torch.float16

def get_model(
    model_id="unsloth/mistral-7b-bnb-4bit",
    reference=True,
    max_seq_length=2048,
    dtype=torch.float16,
    load_in_4bit=True,
    init_lora_weights=False,
    upcast=True,
):
    model, tokenizer = FastLanguageModel.from_pretrained(
        model_name=model_id,
        max_seq_length=max_seq_length,
        dtype=dtype,
        load_in_4bit=load_in_4bit,
    )

    lora_config = LoraConfig(
        r=16,
        target_modules=[
            "q_proj",
            "k_proj",
            "v_proj",
            "o_proj",
            "gate_proj",
            "up_proj",
            "down_proj",
        ],
        lora_alpha=16,
        lora_dropout=0,
        bias="none",
        task_type="CAUSAL_LM",
        init_lora_weights=init_lora_weights,
    )

    if reference:
        model = prepare_model_for_kbit_training(
            model,
            use_gradient_checkpointing=True,
            gradient_checkpointing_kwargs={"use_reentrant": True},
        )
        model = get_peft_model(model, lora_config)
    else:
        config = lora_config.to_dict()
        del config["task_type"]
        model = FastLanguageModel.get_peft_model(
            model,
            use_gradient_checkpointing=True,
            random_state=3407,
            max_seq_length=max_seq_length,
            upcast=upcast,
            **config,
        )

    return model, tokenizer

ref_model, _ = get_model(dtype=DTYPE)
test_model, _ = get_model(dtype=DTYPE, reference=False)

def check_grad(model, dtype, seed=0, scale=1):
    wrapped_model = model.model.model
    embed_layer = wrapped_model.embed_tokens
    self_attn = wrapped_model.layers[0].self_attn
    mlp = wrapped_model.layers[0].mlp
    torch.manual_seed(seed)

    with torch.autocast(device_type="cuda", dtype=dtype):
        # embeddings = embed_layer(inputs)

        embeddings = torch.randn(
            1, 1, embed_layer.weight.shape[1], dtype=dtype, requires_grad=True
        ).cuda()
        print(f"Attention input dtype: {embeddings.dtype}")
        attn_out, *_ = self_attn(embeddings)
        print(f"Attn out dtype: {attn_out.dtype}")
        mlp_out = mlp(attn_out)

        torch.manual_seed(seed)
        fake_grad_output = scale * torch.randn(mlp_out.shape, dtype=torch.float32).to(
            mlp_out.device
        )
        mlp_out.backward(fake_grad_output)

    return mlp_out, mlp, attn_out, fake_grad_output

mlp_out_ref, mlp_ref, attn_out_ref, fake_grad_ref = check_grad(ref_model, dtype=DTYPE)
print(
    "Grad check after reference backwards:",
    test_model.model.model.layers[0].mlp.down_proj.lora_B.default.weight.grad,
)
mlp_out, mlp, attn_out, fake_grad = check_grad(test_model, dtype=DTYPE)

ref_type = torch.float32
print()
print(
    f"Checking fake grad (dY): {torch.allclose(fake_grad.to(ref_type), fake_grad_ref.to(ref_type))}"
)
# torch.max(torch.abs(fake_grad.to(ref_type) - fake_grad_ref.to(ref_type)))
# torch.allclose(mlp_out.to(ref_type), mlp_out_ref.to(ref_type))

print(f"Checking mlp grads:")
for (n1, m1), (n2, m2) in zip(mlp.named_parameters(), mlp_ref.named_parameters()):
    if "lora" in n1 and "lora" in n2:
        n1 = ".".join(n1.split(".")[:2])
        print(f"{n1}")
        print(
            f"Mean grad:\n  UNSLOTH: {m1.grad.max():.10f}\n  REF: {m2.grad.mean():.10f}\nMax abs diff: {torch.max(torch.abs(m1.grad - m2.grad)):.10f}\nMean abs diff: {torch.mean(torch.abs(m1.grad - m2.grad)):.10f}"
        )
        print()

print("Checking attn grads:")
for (n1, m1), (n2, m2) in zip(
    ref_model.model.model.layers[0].self_attn.named_parameters(),
    test_model.model.model.layers[0].self_attn.named_parameters(),
):
    if "lora" in n1 and "lora" in n2:
        # torch.allclose(m1.grad.to(dtype), m2.grad.to(dtype))
        n1 = ".".join(n1.split(".")[:2])
        print(f"{n1}")
        print(
            f"Mean grad:\n  UNSLOTH: {m1.grad.max():.10f}\n  REF: {m2.grad.max():.10f}\nMax abs diff: {torch.max(torch.abs(m1.grad - m2.grad)):.10f}\nMean abs diff: {torch.mean(torch.abs(m1.grad - m2.grad)):.10f}"
        )
        print()

Note: there are small inconsistencies between prepare_model_for_kbit_training in unsloth vs. huggingface peft when doing QLoRA fine-tuning -- peft upcasts all non-INT-8 params to fp32 -- see here.

I added an upcast kwarg to unsloth FastLanguageModel.get_peft_model that is passed to prepare_model_for_kbit_training to replicate this behavior:

def prepare_model_for_kbit_training(
    model: Any,
    use_gradient_checkpointing: bool = True,
    use_reentrant: Optional[bool] = True,
    upcast=False,
) -> Any:
    """
    Calculates where to place the gradient checkpoints given n_layers.
    We also freeze all other layers's gradients

    Args:
        model: Any LlamaModel with layers.
        use_gradient_checkpointing (`bool`, *optional*):
            Default enabled. Provides memory savings by not saving all activations,
            but only some.
        use_reentrant (`bool`, *optional*):
            https://github.com/pytorch/pytorch/blob/main/torch/utils/checkpoint.py#L354
            Optimal gradient checkpointing algorithm which will be the default in
            future Pytorch versions.
    """

    # Freeze all parameters
    for param in model.parameters():
        param.requires_grad_(False)

    # Cast non INT8 parameters to fp32
    if upcast:
        for param in model.parameters():
            if (param.dtype == torch.float16) or (param.dtype == torch.bfloat16):
                param.data = param.data.to(torch.float32)

    if use_gradient_checkpointing:
        model.gradient_checkpointing_enable()

    # If use_reentrant = True which is the Pytorch default, we just make the input requires_grad.
    if use_reentrant:
        if hasattr(model, "enable_input_require_grads"):
            model.enable_input_require_grads()
        else:

            def make_inputs_require_grad(module, input, output):
                output.requires_grad_(True)

            model.get_input_embeddings().register_forward_hook(make_inputs_require_grad)

    return model

Here is the output from running the above script:

Checking mlp grads:
gate_proj.lora_A
Mean grad:
  UNSLOTH: 0.0441589355
  REF: 0.0000020351
Max abs diff: 0.1207160950
Mean abs diff: 0.0097856047

gate_proj.lora_B
Mean grad:
  UNSLOTH: 0.0051155090
  REF: 0.0000001698
Max abs diff: 0.0086461902
Mean abs diff: 0.0002924677

up_proj.lora_A
Mean grad:
  UNSLOTH: 0.0850219727
  REF: -0.0000299520
Max abs diff: 0.1020736694
Mean abs diff: 0.0135316616

up_proj.lora_B
Mean grad:
  UNSLOTH: 0.0048866272
  REF: -0.0000000757
Max abs diff: 0.0068296790
Mean abs diff: 0.0002973406

down_proj.lora_A
Mean grad:
  UNSLOTH: 0.0928344727
  REF: -0.0000352956
Max abs diff: 0.2047328949
Mean abs diff: 0.0073212739

down_proj.lora_B
Mean grad:
  UNSLOTH: 0.0037288666
  REF: 0.0000003116
Max abs diff: 0.0040407181
Mean abs diff: 0.0002820148

Checking attn grads:
q_proj.lora_A
Mean grad:
  UNSLOTH: -0.0000000000
  REF: -0.0000000000
Max abs diff: 0.0000000000
Mean abs diff: 0.0000000000

q_proj.lora_B
Mean grad:
  UNSLOTH: 0.0000000000
  REF: -0.0000000000
Max abs diff: 0.0000000000
Mean abs diff: 0.0000000000

k_proj.lora_A
Mean grad:
  UNSLOTH: -0.0000000000
  REF: -0.0000000000
Max abs diff: 0.0000000000
Mean abs diff: 0.0000000000

k_proj.lora_B
Mean grad:
  UNSLOTH: -0.0000000000
  REF: 0.0000000000
Max abs diff: 0.0000000000
Mean abs diff: 0.0000000000

v_proj.lora_A
Mean grad:
  UNSLOTH: 0.1055297852
  REF: 0.1329345703
Max abs diff: 0.1655731201
Mean abs diff: 0.0144135132

v_proj.lora_B
Mean grad:
  UNSLOTH: 0.0139694214
  REF: 0.0166625977
Max abs diff: 0.0193632841
Mean abs diff: 0.0024413881

o_proj.lora_A
Mean grad:
  UNSLOTH: 0.1630859375
  REF: 0.1149902344
Max abs diff: 0.1842651367
Mean abs diff: 0.0191203523

o_proj.lora_B
Mean grad:
  UNSLOTH: 0.0102157593
  REF: 0.0053596497
Max abs diff: 0.0119572878
Mean abs diff: 0.0010805393

Thoughts?

danielhanchen commented 7 months ago

@jeromeku Great work! Some pointers:

torch.manual_seed sadly does not actually work on GPUs - torch.cuda.manual_seed is the one you want!!

torch.randn can also take device = "cuda" - so I guess my first point of manual_seed is irrelevant since ur copying from CPU to GPU

Yep one issue is the upcasting to float32 which is one of the optimizations we found for VRAM reduction.

You can see there are error differences - mainly due to Flash Attention - Pytorch does Q @ K.T and other attention ops in float16, whilst FA upcasts internally to fp32, which makes it more equivalent to full float32 training - hence the error differences.

I think the reference model you used does not have FA enabled.

But ye - great work again - super useful script :)))

jeromeku commented 7 months ago

@danielhanchen

What do you consider permissible range of gradient discrepancies between the unsloth and the reference HF implementation?

I.e., there are differences (e.g., up_proj) that are on the same order of magnitude as the mean grads themselves -- can this be chalked up to the use of f32 vs f16...

danielhanchen commented 7 months ago

@jeromeku Ye one of the issues I found as well when verifying Unsloth vs normal HF - thats what I for now opted to just compare training losses directly

jeromeku commented 7 months ago

@danielhanchen

Just wanted to give a quick update:

danielhanchen commented 7 months ago

@jeromeku Super great work! Are you testing it on a Tesla T4 or Ampere based GPU? I found older GPUs Triton kernels to be noticeably slower.

Also I found through experimentation instead of writing 1 full fused kernel for matrix mult and dequantization, to split it into 2. The dequant step should only take 1-2ms, whilst the matrix mult takes 30ms or so. The compiler can be "confused" on the dequant steps, causing it to not optimize correctly, so I found using torch.matmul to be most effective.

jeromeku commented 7 months ago

@danielhanchen I've been testing on an Ampere-based GPU (A6000).

danielhanchen commented 7 months ago

@jeromeku Oh ok cool! If I have to guess, it's that NVCC / the Trtion compiler is not optimizing "properly" - also did u use the matmul Triton autotuner? It could be that maybe?

jeromeku commented 7 months ago

@danielhanchen Yes - used a custom autotuner that is essentially the same as default triton matmul autotuner. Without the autotuner, performance is even worse.

danielhanchen commented 7 months ago

@jeromeku Ohh ok ok interesting - I'm just guessing somewhere the compiler is not optimizing the dequantization parts properly

jeromeku commented 7 months ago

Did some preliminary profiling using torch.profiler of 4 implementations:

All were 4-bit Mistral models ("TheBloke/Mistral-7B-v0.1-GPTQ" and "unsloth/mistral-7b-bnb-4bit") running a sample batch of data for 10 iterations (5 warmup, 5 active) using float16 as torch.autocast dtype.

Summary results, sorted by CUDA time:

It seems the custom LoRA layers of my GPTQ implementation not as efficient as the existing bitsandbytes fast_lora implementations.

Will draft PR the profiling script and documentation along with current fast_lora gptq implementation once cleaned up.

danielhanchen commented 7 months ago

@jeromeku LOVEE the detailed profiling!!! Just love it!! Great work again. Interesting so the Unsloth BnB kernels run in around 3.34s whilst HF's GPTQ runs in 6.2s. HF GPTQ with your Triton patch is 8 ish seconds, and Unsloth with your Trtion patch is 6.8 seconds.

Very interesting results! Did you manage to test a GPTQ just dequantize kernel, but with Unsloth? I can see in Unsloth, matrix multiplies are taking 26% of all time, whilst GPTQ is 13% Unsloth Triton is 3% (looks like overhead?) and HF + Triton is 1.5%. The goal is to move the majority of the time over to matrix multiplies in order to leverage the GPU's Tensor Cores :))

But anyways I love the table and results and fabulous work!

jeromeku commented 7 months ago

@danielhanchen

Yes -- there seems to be some overhead issues with the unsloth triton quant / dequant kernels.

Just opened a draft PR with the changes.