Sneakr commented 7 months ago

Thanks for your efforts , looking forward to new updates.

I'm experimenting with ternary quant and 2 bits, I'm using the fine tuning example file and saved the quant successfully into these files: qmodel.pt config.json

########################################################### trainer = SFTTrainer( model=WrappedModel(model), tokenizer=tokenizer, max_seq_length=max_tokens, train_dataset=dataset, eval_dataset=None, peft_config=None, args=training_args, packing=True,

neftune_noise_alpha=5,

#data_collator=data_collator,

)

model.is_parallelizable = False trainer.is_model_parallel = False trainer.place_model_on_device = False

print('perplexity', compute_perplexity_batched(model=model, tokenizer=tokenizer, predictions=[s['text'] for s in dataset_val], batch_size=1, max_length=max_tokens))

model.train() trainer.train()

save_dir = './quantized_models/mistral7b02instruct' print(f"Saving into {save_dir}") model.save_quantized(save_dir)

###########################################################

However, when I reload the quant to continue the fine tuning but when I reload it and train it with the same arguments and SFTTrainer I get this error:

Traceback (most recent call last): File "/home/abc/workspace/distill.py", line 29, in model = HQQModelForCausalLM.from_quantized(save_dir) File "/home/abc/miniconda3/envs/unsloth_env/lib/python3.10/site-packages/hqq/engine/base.py", line 86, in from_quantized model = cls._get_hqq_class(arch_key).from_quantized( File "/home/abc/miniconda3/envs/unsloth_env/lib/python3.10/site-packages/hqq/models/base.py", line 364, in from_quantized cls.patch_model( File "/home/abc/miniconda3/envs/unsloth_env/lib/python3.10/site-packages/hqq/models/base.py", line 160, in patch_model cls.patch_linearlayers(model, patch_linear_fct, patch_params, verbose=verbose) File "/home/abc/miniconda3/envs/unsloth_env/lib/python3.10/site-packages/hqq/models/hf/mistral.py", line 44, in patch_linearlayers layers[i].self_attn.q_proj = patch_fct( File "/home/abc/miniconda3/envs/unsloth_env/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context return func(*args, **kwargs) File "/home/abc/miniconda3/envs/unsloth_env/lib/python3.10/site-packages/hqq/models/base.py", line 337, in _load_module return module.to(device=device, dtype=compute_dtype, non_blocking=True) File "/home/abc/miniconda3/envs/unsloth_env/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1160, in to return self._apply(convert) File "/home/abc/miniconda3/envs/unsloth_env/lib/python3.10/site-packages/torch/nn/modules/module.py", line 833, in _apply param_applied = fn(param) File "/home/abc/miniconda3/envs/unsloth_env/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1158, in convert return t.to(device, dtype if t.is_floating_point() or t.is_complex() else None, non_blocking) NotImplementedError: Cannot copy out of meta tensor; no data!

###########################################################

Here's the code I modified in your example, you can see how I load the model:

########################################################### hf_auth = None #HuggingFace token cache_path = 'cache' #cache directory to store data device = 'cuda:0'

Chose a model

model_id = "mistralai/Mistral-7B-Instruct-v0.2" save_dir = './quantized_models/mistral7b02instruct'

model_id = "meta-llama/Llama-2-13b-hf"

model_id = "meta-llama/Llama-2-70b-hf"

HQQ Quantize

###################################################################################### ###################################################################################### from hqq.engine.hf import HQQModelForCausalLM, AutoTokenizer import torch from hqq.core.quantize import * import gc

model = HQQModelForCausalLM.from_quantized(save_dir, device='cuda')

tokenizer = AutoTokenizer.from_pretrained(model_id, use_auth_token=hf_auth, cache_dir=cache_path)

train_dtype = torch.float32

model = HQQModelForCausalLM.from_pretrained(model_id, torch_dtype=train_dtype)

model = HQQModelForCausalLM.from_quantized(save_dir) tokenizer = AutoTokenizer.from_pretrained(model_id)

quant_config = BaseQuantizeConfig(nbits=2, group_size=16, quant_scale=False, quant_zero=False)

model.quantize_model(quant_config=quant_config)

model.to('cuda')

mobicham commented 7 months ago

Hi @Sneakr

I thought the SFTTrainer was broken with HQQ, good to see that it works now!

save_quantized / from_quantized only save/load the quantized model. The low-rank adapters are saved and loaded separately: https://github.com/mobiusml/hqq/?tab=readme-ov-file#peft-training

Sneakr commented 7 months ago

Yes, it seems to work well!

Thank you! Looking forward if you could possible implement ternary bitnet into the quant options, I assume we could use the 2 bit to hold the values but using ternary values to make it compatible with inference and hardware. I'm relatively new to training LLM and trying to test some theories of mine. Appreciate your efforts, thanks again!

mobicham commented 7 months ago

Great, happy to help! Yeah we could use 2-bit packing for ternary but as [0,1,2] values not [-1,0,1],, the issue is that bit-packing negative values with int8/uint8 is a bit tricky, but using [0,1,2] and adapting the zero-point is equivalent to [-1,0,1] since it's just a shift by -1.

Sneakr commented 7 months ago

Sorry for bumping this up, but trying to get som clarification if possible any pointer in any direction so I can get some clarify would be appreciated.

When I evaluate the 2 bits (as well as 4 bits) I get great results using HQQ quantization. I have 8-bit quant config for the attention headers.

However, If I add the Lora parameters and run 1 small training set, the eval numbers significantly decrease, regardless of my dataset and training.

I noticed however when I fine tune using math dataset, the evaluation numbers goes back up slightly for gsm8k for example, but still far beyond the pre-fine tuning process.

I think this would have something to do with the precision loss during fine tuning a quant (?) but I'm only guessing, do you have any pointers or hint what the issue could be? Or do you simply suggest avoid fine tuning a quant HQQ? The same goes for 4 bit and so on altough starting with higher evaluation results compared to the 2-bit the drop is significant too as soon as training starts.

I tested with the SFTTrainer without Lora , the drop was not as high but it still dropped significantly too. Adjusting Lora parameters did have some different impact. Here's the tests I did for one eval and a dataset with 1000 entries:

wandb: winogrande/alias winogrande

Quantized model to 2 bit quantization with 8 bit header, no fine tuning: wandb: winogrande/acc 0.69771 wandb: winogrande/acc_stderr 0.01291

self_attn.q_proj, after fine tuning the Quantized model with only this layer active with LORA parameters: wandb: winogrande/acc 0.61089 wandb: winogrande/acc_stderr 0.0137

self_attn.k_proj, after fine tuning the Quantized model with only this layer active with LORA parameters: wandb: winogrande/acc 0.64009 wandb: winogrande/acc_stderr 0.01349

self_self_attn.v_proj, after fine tuning the Quantized model with only this layer active with LORA parameters: wandb: winogrande/acc 0.49961 wandb: winogrande/acc_stderr 0.01405 wandb: winogrande/alias winogrande

self_attn.o_proj, after fine tuning the Quantized model with only this layer active with LORA parameters: wandb: winogrande/acc 0.57301 wandb: winogrande/acc_stderr 0.0139

mlp.gate_proj, after fine tuning the Quantized model with only this layer active with LORA parameters: wandb: winogrande/acc 0.65272 wandb: winogrande/acc_stderr 0.01338 wandb: winogrande/alias winogrande

mlp.up_proj, after fine tuning the Quantized model with only this layer active with LORA parameters: wandb: winogrande/acc 0.64088 wandb: winogrande/acc_stderr 0.01348

mlp.down_proj, after fine tuning the Quantized model with only this layer active with LORA parameters: wandb: winogrande/acc 0.5438 wandb: winogrande/acc_stderr 0.014

mobicham commented 7 months ago

Hey, sure no problem, happy to help:

What learning rate and scheduler do you use?
If your data is not diverse enough, it would probably overfit. Is the same thing happening with the mix of data we shared in the blogpost? For faster experimentation, can you try: OpenAssistant (10K) + MathQA (10K) with random.seed(100), ALL 2-bit group-size=8, quant_zero/quant_scale=False, you should get something like:

HellaSwag (10-shot): 73.82
MMLU (5-shot): 43.13
TruthfulQA-MC2 (0-shot): 41.5
Winogrande (5-shot): 70.56
GSM8K (5-shot): 29.11
Average: 50.88

"I tested with the SFTTrainer without Lora": What are the trainable parameters in this case? Other than the LoRA weights, nothing is trainable.
Did you make sure that the LoRA weights are either bfp16 or fp32?
if you use 8-bit, you don't need LoRA for those (pass None in the peft config), unless you actually want to fine-tune them to be aligned with your data.

Otherwise, can you share a snippet of the training code (you can hide the data and model), just to check if everything looks ok. I am using my own custom training code (no SFTTrainer), so maybe SFTTrainer is doing something to the model. I know that for example, it casts normalization layers to fp32 and some other things, to be confirmed.

Sneakr commented 7 months ago

I got it working now following your tips, getting good results!

One sidenote, I installed https://github.com/bigcode-project/bigcode-evaluation-harness to be able to run Humaneval and so on, it crashed on HQQ due to the class HQQModelForCausalLM returned the function instead of the model, I refactored it to use self.hqq_quantized instead of quantized and it works now. Hope I didn't break anything, I'm not a python coder but this works so far. Thanks!

    @classmethod
    def _make_quantizable(cls, model, quantized: bool) -> None:
        model.hqq_quantized = quantized
        model.arch_key = model.config.architectures[0]
        model.quantize_model = lambda quant_config, compute_dtype=torch.float16, device="cuda": cls.quantize_model_(
            model=model, quant_config=quant_config, compute_dtype=compute_dtype, device=device
        )
        model.save_quantized = lambda save_dir: cls.save_quantized_(model=model, save_dir=save_dir)
        model.base_class = cls._get_hqq_class(model)

        def _quantized_method(method_name):
            def _wrapper(self, *args, **kwargs):
                if self.hqq_quantized:
                    return self
                else:
                    return getattr(super(type(self), self), method_name)(*args, **kwargs)
            return _wrapper

        model.cuda = _quantized_method('cuda').__get__(model)
        model.to = _quantized_method('to').__get__(model)
        model.float = _quantized_method('float').__get__(model)
        model.half = _quantized_method('half').__get__(model)

mobicham commented 7 months ago

Thanks for the tip! That HQQModelForCausalLM class is a bit of a mess, needs some refactoring. Better use https://github.com/mobiusml/hqq/?tab=readme-ov-file#auto-mode-1 for Hugging Face models. The reason why it exists is to have different engines (HF, timm, VLLM), but now that everyone is using just HF and VLLM would need a separate branch, it's just redundant messy code.

Sneakr commented 7 months ago

I'm sorry for taking your time but I have one final question, I have managed to quantize mixtral to 2bit with 4bit attention header, and I got slightly better results than your quantized mixtral due to using fine tuned one.

I'm trying to save the quant as a safetensors file , but it does not seem possible? The file output does not contain the weights and only around 500mb big only 1 file. Is it not possible to save it as safetensors file if its quantized?

I looked at AWQ and tried to implement their method into your pack:


class MixtralHQQ(MixtralPatch, BaseHQQHFModel):
    def save_quantized(self, save_dir: str, safetensors: bool = True, shard_size: str = "5GB"):
        save_dir = save_dir[:-1] if save_dir[-1] == "/" else save_dir

        # Save model
        class EmptyModule(nn.Module):
            def __init__(self):
                super(EmptyModule, self).__init__()

            def forward(self, x):
                return x

        # Save model and config files with empty state dict
        self.config.save_pretrained(save_dir, state_dict=EmptyModule().state_dict())

        # Remove empty state dict
        default_paths = [
            f"{save_dir}/model.safetensors",
            f"{save_dir}/pytorch_model.bin",
        ]
        for path in default_paths:
            if os.path.exists(path):
                os.remove(path)

        # model_name has no extension, add it when saving state_dict
        model_name = "model.safetensors" if safetensors else "pytorch_model.bin"

        logging.info(f"Total model size before saving: {sum([param.nelement() for param in self.parameters()])} elements")

        # shard checkpoint into chunks (10GB default)
        shards, index = shard_checkpoint(
            self.state_dict(), max_shard_size=shard_size, weights_name=model_name
        )
        print(f"Saving model to {save_dir} in {len(shards)} shards")

        logging.info(f"Saving model to {save_dir} in {len(shards)} shards")
        for shard_file, shard in shards.items():
            if safetensors:
                original_shard_size = {k: v.nelement() for k, v in shard.items()}
                # safetensors must be in the same memory, so we duplicate and use contiguous memory
                shard = {k: v.clone().contiguous() for k, v in shard.items()}
                save_file(
                    shard, os.path.join(save_dir, shard_file), metadata={"format": "pt"}
                )
                print(f"Saved {shard_file} with safetensors")
                reloaded_shard = load_file(os.path.join(save_dir, shard_file))  # Use appropriate SafeTensors load function
                reloaded_shard_size = {k: v.nelement() for k, v in reloaded_shard.items()}
                assert original_shard_size == reloaded_shard_size, "Mismatch in shard sizes before and after saving with SafeTensors"
            else:
                torch.save(shard, os.path.join(save_dir, shard_file))
                print(f"Saved {shard_file}")

        # save shard index
        if index is not None:
            with open(f"{save_dir}/{model_name}.index.json", "w+") as file:
                file.write(json.dumps(index, indent=4))
    pass

This did not make any difference , I still see only 1 file at the same size. It does not contain the weights, neither for mistral regular model. Do you have any pointer to direct me in the right direction? I'm feeling so close now :) Cheers!

mobicham commented 7 months ago

Hi! You can't use safetensors with HQQ-quantized models, because safetensors doesn't support storing certain types which are necessary for the meta-data in HQQ. There are so many things that are done when we load the quantized model, so better use the functions provided in the documentation:

If the architecture is exactly like the original Mixtral, you can do this:

#Save the quantized model
model.save_quantized(model, save_dir=save_dir)

#Load from local directory or Hugging Face Hub on a specific device
model = HQQModelForCausalLM.from_quantized(save_dir_or_hfhub, device='cuda')

Otherwise, you can use Auto-Mode: https://github.com/mobiusml/hqq/?tab=readme-ov-file#auto-mode-1

Sneakr commented 7 months ago

@mobicham Thanks for reply! Oh, I see now and the meta data. What would it take for me to work around this and be able to provide support for safetensors? Is it even possible with the HQQ if I try to refactor things or is it completely hopeless to try?

Where can I start if possible, any pointers in the right direction would be nice. Thanks for your time!

Btw, I saw that Intel has taken your HQQ code into their github compressor pack, good work and gratz! :)

Sneakr commented 7 months ago

Update:

I managed to refactor and create something new but based on your HQQ . I made my first ever GGUF which is only 1.9gb big in file size - a Mistral 7B

"mistralai/Mistral-7B-Instruct-v0.2"

I ran winogrande on your 2-bit quant using HQQ: (0 shot for both model evals)

q2_config    = BaseQuantizeConfig(nbits=2, group_size=8)
qA_config    = BaseQuantizeConfig(nbits=4, group_size=64)
linear_tags  = HQQModelForCausalLM.get_linear_tags(model)
quant_config = {k: q2_config for k in linear_tags}
quant_config['self_attn.v_proj'] = qA_config

wandb: Run summary: wandb: winogrande/acc 0.71823 wandb: winogrande/acc_stderr 0.01264 wandb: winogrande/alias winogrande

Here's my results from the same mistral quant into 2 bits using the GGUF directly to evaluate: winograndeee

Great work! =)

mobicham commented 7 months ago

@Sneakr that's great, really appreciate your work here!

I think it would be great to have a util tool to port HQQ into GGUF. Maybe we can add it in https://github.com/mobiusml/hqq/blob/master/hqq/models/base.py

Regarding safetenors, it's a bit tricky but possible, there are different ways of doing it in a way that wouldn't break other things. I can tell you more in details on Discord: https://discord.com/invite/VJcFz5TR

mobiusml / hqq

Load saved quant to continue training #31

neftune_noise_alpha=5,

print('perplexity', compute_perplexity_batched(model=model, tokenizer=tokenizer, predictions=[s['text'] for s in dataset_val], batch_size=1, max_length=max_tokens))

Chose a model

model_id = "meta-llama/Llama-2-13b-hf"

model_id = "meta-llama/Llama-2-70b-hf"

HQQ Quantize

model = HQQModelForCausalLM.from_quantized(save_dir, device='cuda')

tokenizer = AutoTokenizer.from_pretrained(model_id, use_auth_token=hf_auth, cache_dir=cache_path)

model = HQQModelForCausalLM.from_pretrained(model_id, torch_dtype=train_dtype)

quant_config = BaseQuantizeConfig(nbits=2, group_size=16, quant_scale=False, quant_zero=False)

model.quantize_model(quant_config=quant_config)