Closed zxbjushuai closed 2 months ago
Hm I used my custom code for training so I don't know what are difference vs. peft, here are the parameters I used (as far as I remember):
compute_dtype=bfloat16
group_size = 8
lora-size 32 for ["q_proj","k_proj","v_proj", "o_proj"], lora size 8 for the rest.
n_epochs: 1
learning rate should probably be higher, maybe ~2e-4 and linearly decreased
axis=0 actually was used in the hqq config
thank you!I will try it.And how do you evaluate the model after fine-tuning?
https://github.com/mobiusml/hqq/blob/master/examples/llama2_benchmark/eval_model.py#L12
By the way, make sure that the whole model is frozen and only the LoRA layers are trainable
I don't know how to set ' lora-size 32 for ["q_proj","k_proj","v_proj", "o_proj"], lora size 8 for the rest ' so I set them all 32. Another question is why set group size to 8?does it Influence the result?
Another question is why set group size to 8?does it Influence the result?
Yes, a lot actually, for 1-bit you need a lower group-size since you lose a lot of accuracy at higher group-sizes.
On a separate note, why do you want to do 1-bit? It's totally useless for modern GPUs, you'd be better off using 2-bit which is also supported for fast inference via the BitBlas backend
Yes, a lot actually, for 1-bit you need a lower group-size since you lose a lot of accuracy at higher group-sizes.
I see.And I find that model with group-size 8 is 3.2GB and the one with group-size 64 is 6GB.
On a separate note, why do you want to do 1-bit? It's totally useless for modern GPUs, you'd be better off using 2-bit which is also supported for fast inference via the BitBlas backend.
1-bit model is popular and I want to do some reasearch on low-bit models.Anyway,thanks for your adivce🤗
I see.And I find that model with group-size 8 is 3.2GB and the one with group-size 64 is 6GB.
That can't be right. group-size 8 should be larger than the group-size 64 in terms of GB, since the meta-data will be larger.
1-bit model is popular and I want to do some reasearch on low-bit models.Anyway,thanks for your adivce🤗
I think you are referring to 1.58 not 1-bit. 1.58 is basically 2-bit in terms of inference kernel. 1-bit with HQQ is binary [0,1] not [-,1,0,1]. So 1-bit with hqq will be much worse, and 2-bit will be better than ternary for the same inference speed as 1.58bit on GPUs
That can't be right. group-size 8 should be larger than the group-size 64 in terms of GB, since the meta-data will be larger.
I made a mistake.The fact is that model with group-size 8 is 6GB and the one with group-size 64 is 3.2GB.
I think you are referring to 1.58 not 1-bit. 1.58 is basically 2-bit in terms of inference kernel. 1-bit with HQQ is binary [0,1] not [-,1,0,1]. So 1-bit with hqq will be much worse, and 2-bit will be better than ternary for the same inference speed as 1.58bit on GPUs
I choose HQQ first because it achieves the true meaning of 1bit,[0,1].I know 1.58bit and appreciate their good job.But unfortunately their paper doesn't give a link with code.So I firstly reproduce other results that are easier to obtain,such as lamma.cpp on 2-bit.1.58-bit is also in my plan and I will compare its result with others at last.
Note that 1.58 trains the whole model, while HQQ+ only fine-tunes a fraction of the weights via LoRA, so it's not a direct comparison but I understand your motivation :+1: !
Yes,another repository(https://github.com/xuyuzhuang11/OneBit) also train the whole model use deepspeed and is binary [0,1].But I only have 16GB x 4 so OOM😭 when traing llama3-8b.That is another reason why I use HQQ+.
Sorry to bothor you again.I did not use all the datasets mentioned in the blog for fine-tuning but I still have some questions. 1.What is the difference between AutoHQQHFModel and HQQModelForCausalLM if I want to load example model? I load example model successfully with HQQModelForCausalLM but encounter an error:
device = torch.device("cuda:0")
model_id = 'mobiuslabsgmbh/Llama-2-7b-chat-hf_1bitgs8_hqq'
model = HQQModelForCausalLM.from_quantized(model_id, adapter='adapter_v0.1.lora').to(device)
input_text = "Once upon a time"
input_ids = tokenizer.encode(input_text, return_tensors='pt').to(device)#cuda:0
model.eval()
model.generate(input_ids, max_length=200,do_sample=True)
Expected all tensors to be on the same device, but found at least two devices, cpu and cuda:0! (when checking argument for argument mat2 in method wrapper_CUDA_bmm)
and load with AutoHQQHFModel then get bad result
model = AutoHQQHFModel.from_quantized(model_id, adapter='adapter_v0.1.lora',device= device)
model ouput в Once upon a time- isalt... mejorntilombvaicus,ichistrelysing spring eprintist stalig leche fl mh Chartavel whateverms Out elliht point Dcman uniquem in onewh of fufme fixale Searchandmpyes @end RavoomhHecпоre off Ranptaebahcapa ...
2.I mentioned that llama3-8b model with config HqqConfig(nbits=1, group_size=8 ,axis=0,quant_zero=False, quant_scale=False)
occupies 6GB memory.But 4-bit quantized model occupies approximately 5GB using AWQ and llama.cpp.How can I get a low-bit model which has high accuracy and takes up little memory?
Thank you in advance.🫡
I load example model successfully with HQQModelForCausalLM but encounter an error:
Error happens when generating text
@mobicham
Hi!
Please use AutoHQQHFModel
to both save and load, instead of HQQModelForCausalLM
if you are using the newer version
Regarding the 2 other questions: I think I know what is the problem: in the latest versions of hqq (>v.0.1.8) we dropped support for quantized scales/zeros and cpu-offloading. That specific model you are trying to load needs quantized scales/zeros since the group-size is very small. Additionally, you need cpu-offloading for the scales/zeros. While the model on disk seems like 3GB, when you actually try to run it, it will only use 1.76GB, so I would recommend you check the VRAM usage not the model size. So if you want to run it, please use
pip install hqq==0.1.8
pip install transformers==4.40
and copy/paste the code from the HF page, I just updated the page to mention that.
Let me know if there's any issue !Below the output I just run it:
In [2]: outputs = chat_processor("What is the solution to x^2 - 1 = 0", max_new_tokens=1000, do_sample=False)
User: What is the solution to x^2 - 1 = 0
Assistant:
The equation $x^2 - 1 = 0$ can be factored as $(x-1)(x+1) = 0$.
You want to find a value of $x$ that makes this true for all values of $x$. This means that either $x=1$ or $-1$, or $x=-1$. So, there are two solutions: $x=\boxed{1}$ and $x=\boxed{-1}$. The answer is: 1
I see!The version of hqq in my environment is 0.2.0.And I want to know what is the usage of quantized scales/zeros?
That specific model you are trying to load needs quantized scales/zeros since the group-size is very small.
The group-size of the model I am fine-tuning is also very small. Do I need to rollback the hqq version to enable quantized scales/zeros?Actually after fine-tuning using wikitext-2-raw-v1 dataset the perplexity of my model is a bit high(thousands).I don't know if that's normal. Thanks for your great work and help.Expected for your reply.🫡
Yeah, if you want to use quantized zeros/scales, you need hqq==0.1.8
at most, since after that you can't use quantized scales/zeros. The reason why we dropped support for it in the newer versions is because it made model serialization too complicated and our priority is to make the models fully compatible with transformers serialization.
But you can train without the quantized scales/zeros, then quantize them later when you save the model, I think this how I did it as far as I remember.
Can you share your training code, I can run it locally and try to see what's going on
Sure.Here is my code.I load model and dataset from my disk so you need to modify some code.I hope this doesn't bother you too much.
The model is meta-llama/Meta-Llama-3-8B
transformers == 4.44.0
hqq == 0.2.0
import torch
from alpaca_lora.utils.prompter import Prompter
from datasets import load_from_disk,load_dataset
from hqq.engine.hf import HQQModelForCausalLM
from hqq.models.hf.base import AutoHQQHFModel
from hqq.core.peft import PeftUtils
from hqq.core.quantize import *
from datasets import load_dataset
from transformers import HqqConfig
from transformers import (
AutoModelForCausalLM,
AutoTokenizer,
DataCollatorForLanguageModeling,
Trainer,
TrainingArguments,
)
from peft import PeftModel,PeftModelForCausalLM
from peft.utils import SAFETENSORS_WEIGHTS_NAME
from peft import (
LoraConfig,
get_peft_model,
prepare_model_for_kbit_training,
)
model_id="../models/llama3"
adapter_dir = "llama3_hqq_peft/adapters"
model_save_dir = "llama3_hqq_peft/model"
compute_dtype = torch.bfloat16
quant_config = HqqConfig(nbits=1, group_size=8 ,axis=0,quant_zero=False, quant_scale=False)
max_memory={0: "16GiB", 1: "16GiB", 2: "16GiB", 3: "16GiB"}
HQQLinear.set_backend(HQQBackend.PYTORCH_COMPILE)
model = AutoModelForCausalLM.from_pretrained(
model_id,
device_map="auto",#device
torch_dtype=compute_dtype,
quantization_config=quant_config,
max_memory = max_memory,
)
for param in model.parameters():
param.requires_grad = False
tokenizer = AutoTokenizer.from_pretrained(model_id)
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "right"
use_dora = False
config = LoraConfig(
r=32,
lora_alpha=32,
target_modules=["q_proj","k_proj","v_proj", "o_proj", "gate_proj","up_proj","down_proj"],
lora_dropout=0.05,
bias="none",
task_type="CAUSAL_LM",
use_dora=use_dora,
)
model = prepare_model_for_kbit_training(model) #first train
model = get_peft_model(model, config) #first train
data = load_from_disk("../dataset/wikitext-2-raw-v1")['train']
data = data.shuffle(seed=42)
data = data.map(lambda samples: tokenizer(samples["text"], padding=True), batched=True)
trainer = Trainer(
model=model,
train_dataset=data,
args=TrainingArguments(
per_device_train_batch_size=2,
gradient_accumulation_steps=8,
num_train_epochs=1,
learning_rate=2e-4,
logging_steps=20,#1
output_dir=adapter_dir,
bf16 = True,
save_steps=50,
save_total_limit=5,
resume_from_checkpoint=True,
neftune_noise_alpha=0.1,
),
data_collator=DataCollatorForLanguageModeling(tokenizer, mlm=False),
)
model.is_parallelizable = False
trainer.is_model_parallel = False
trainer.place_model_on_device = False
model.config.use_cache = False
trainer.train()
model.save_pretrained(adapter_dir)
Something like this, feel free to play with the parameters
#Settings
#pip install hqq==1.8.0
#pip install trl==
#pip install transformers==4.40.0
#OMP_NUM_THREADS=16 CUDA_VISIBLE_DEVICES=0 ipython3
######################################################################################
import torch
cache_path = ''
model_id = "meta-llama/Llama-2-7b-hf"
compute_dtype = torch.bfloat16
device = 'cuda:0'
#HQQ Quantize
######################################################################################
from transformers import AutoModelForCausalLM, AutoTokenizer
from hqq.models.hf.base import AutoHQQHFModel
from hqq.core.quantize import *
model = AutoModelForCausalLM.from_pretrained(model_id, cache_dir=cache_path, torch_dtype=compute_dtype, attn_implementation="sdpa")
tokenizer = AutoTokenizer.from_pretrained(model_id, cache_dir=cache_path)
#Quantize the model
from hqq.core.quantize import *
quant_config = BaseQuantizeConfig(nbits=2, group_size=8, quant_scale=False, quant_zero=False, axis=0)
AutoHQQHFModel.setup_model(model)
AutoHQQHFModel.quantize_model(model, quant_config=quant_config, compute_dtype=compute_dtype, device=device)
#Add Peft
######################################################################################
from hqq.core.peft import PeftUtils
train_dtype = torch.torch.float32
atten_lora_params = {'lora_type':'default', 'r':32, 'lora_alpha':32, 'dropout':0.05, 'train_dtype':train_dtype, 'train_bias':True}
mlp_lora_params = {'lora_type':'default', 'r':8, 'lora_alpha':8, 'dropout':0.05, 'train_dtype':train_dtype, 'train_bias':True}
lora_params = {'self_attn.q_proj': atten_lora_params,
'self_attn.k_proj': atten_lora_params,
'self_attn.v_proj': atten_lora_params,
'self_attn.o_proj': atten_lora_params,
'mlp.gate_proj' : mlp_lora_params,
'mlp.up_proj' : mlp_lora_params,
'mlp.down_proj' : mlp_lora_params}
#Apply LoRA
PeftUtils.add_lora(model, lora_params)
HQQLinear.set_backend(HQQBackend.ATEN_BACKPROP)
model.config.use_cache = False
#Dataset
######################################################################################
from datasets import load_dataset, Dataset
from tqdm import tqdm
import transformers
import numpy as np
import random
tokenizer.pad_token = tokenizer.unk_token #tokenizer.eos_token
tokenizer.padding_side = "right"
tokenizer.add_bos_token = False
tokenizer.add_eos_token = False
dataset = load_dataset('wikitext', 'wikitext-2-raw-v1', split='train')
#####################################################################################
#Train
from trl import SFTTrainer
grad_acc = 1
logging_st = 1
max_steps = -1
lr = 1e-5 #1e-5 cosine x 2: 5.5009
batch_size = 1
n_epochs = 2
max_tokens = 1024
training_args = transformers.TrainingArguments(
output_dir='.',
per_device_train_batch_size=batch_size,
gradient_accumulation_steps=grad_acc,
learning_rate=lr,
logging_steps=logging_st,
num_train_epochs=n_epochs,
max_steps=max_steps,
remove_unused_columns=False,
bf16=True,
max_grad_norm=1.0,
save_steps=10000000,
lr_scheduler_type= "cosine",
)
#Wrap model to avoid accelerate issues
class WrappedModel(torch.nn.Module):
def __init__(self, model):
super().__init__()
self.model = model
def forward(self, *args, **kwargs):
return self.model.forward(*args, **kwargs)
def train(self):
self.model.train()
def eval(self):
self.model.eval()
def parameters(self):
return self.model.parameters()
trainer = SFTTrainer(
model=WrappedModel(model),
tokenizer=tokenizer,
max_seq_length=max_tokens,
train_dataset=dataset,
eval_dataset=None,
peft_config=None,
args=training_args,
dataset_text_field="text",
packing=True,
)
model.is_parallelizable = False
trainer.is_model_parallel = False
trainer.place_model_on_device = False
model.train()
trainer.train()
# #Prediction/Eval
# ######################################################################################
from datasets import load_dataset
import torch, time
import numpy as np
from tqdm import tqdm
import gc
tokenizer.add_bos_token = True
tokenizer.add_eos_token = False
PeftUtils.cast_lora_weights(model, dtype=compute_dtype)
model.eval()
#Save lora weights
#PeftUtils.save_lora_weights(model, filename)
def cleanup():
torch.cuda.empty_cache()
gc.collect()
#Adapted from https://huggingface.co/transformers/v4.2.2/perplexity.html
def eval_wikitext2(model, tokenizer, max_length=1024, stride=512, verbose=True):
model.eval()
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "right"
tokenizer.add_eos_token = False
dataset = load_dataset('wikitext', 'wikitext-2-raw-v1', split='test')
encodings = tokenizer('\n\n'.join(dataset['text']), return_tensors='pt')
encodings['input_ids'] = encodings['input_ids'].to('cuda')
lls, t = [], []
for i in tqdm(range(0, encodings['input_ids'].size(1), stride), disable=not verbose):
begin_loc = max(i + stride - max_length, 0)
end_loc = min(i + stride, encodings['input_ids'].size(1))
trg_len = end_loc - i
input_ids = encodings['input_ids'][:,begin_loc:end_loc]
target_ids = input_ids.clone()
target_ids[:,:-trg_len] = -100 #ignore context
t1 = time.time()
with torch.no_grad():
log_likelihood = model(input_ids, labels=target_ids).loss * trg_len
torch.cuda.synchronize()
t2 = time.time()
t.append((t2-t1))
lls.append(log_likelihood)
del input_ids, target_ids
ppl = np.round(float(torch.exp(torch.stack(lls).sum() / end_loc)), 4)
pred_time = np.round(np.mean(t), 3)
if(verbose):
print('perplexity', ppl)
print('time', str(pred_time) + ' sec')
del encodings
cleanup()
return {'perplexity':ppl, 'prediction_time':pred_time}
print('perplexity',eval_wikitext2(model, tokenizer, max_length=1024, stride=512, verbose=True))
Thank you. I will try it🤗
Hi,@mobicham,I successfully fine-tuned the llama3-8b model and achieved great results with my code.Here is the result
perplexity 23.0261
time 1.063 sec
{'perplexity': 23.0261, 'prediction_time': 1.063}
And then I find that using different methods to load the model results in different perplexities. The first method is:
quant_config = HqqConfig(nbits=1, group_size=8 ,axis=0,quant_zero=False, quant_scale=False)
HQQLinear.set_backend(HQQBackend.PYTORCH_COMPILE)
model = AutoModelForCausalLM.from_pretrained(
model_id,
device_map="auto",#device
torch_dtype=compute_dtype,
quantization_config=quant_config,
)
model = PeftModel.from_pretrained(model, adapter_1,adapter_name= "adapter_1")
model.set_adapter("adapter_1")
eval_wikitext2(model,tokenizer)
{'perplexity': 23.0261, 'prediction_time': 1.063} The second method is:
device = "cuda:0"
quant_config = HqqConfig(nbits=1, group_size=8 ,axis=0,quant_zero=False, quant_scale=False)
HQQLinear.set_backend(HQQBackend.PYTORCH_COMPILE)
model = AutoModelForCausalLM.from_pretrained(
model_id,
device_map="auto",
torch_dtype=compute_dtype,
quantization_config=quant_config,
)
AutoHQQHFModel.save_quantized(model,model_save_dir)
del model
new_model = AutoHQQHFModel.from_quantized(model_save_dir,device = device)
new_model = PeftModel.from_pretrained(new_model, adapter_1,adapter_name= "adapter_1")
new_model.set_adapter("adapter_1")
eval_wikitext2(new_model,tokenizer,data)
{'perplexity': 9823.5264, 'prediction_time': 0.97}
Two questions:
1.Is it because AutoHQQHFModel
cannot be used to load the model quantized by AutoModelForCausalLM
when loading
lora adapter?And I found the reason why the perplexity was high before. The reason was that every time I evaluated the perplexity, I used AutoHQQHFModel
to load the quantized model instead of using AutoModelForCausalLM
to quantize the original model.
2.How do I correctly save the AutoModelForCausalLM
quantized model and load model with adapter?Or I can just roll back the version and use your code.It seems that transformers and hqq libraries do not work well together.
I find your great work in transformers issue: https://github.com/mobiusml/hqq/pull/93#issuecomment-2230466271 Can it be used to save/load HQQ-quantized model with HF transformers now?By the way,maintaining a repo is really hard work,I think.
I would recommend you use hqq's PeftUtils https://github.com/mobiusml/hqq/tree/master?tab=readme-ov-file#peft-training instead of transformers peft.
1- After training, use PeftUtils.save_lora_weights
to save the adapters
2- You can then either load the quantized model via AutoHQQHFModel.from_quantized
or quantize on the fly
3- After the quantized model is loaded, you can use PeftUtils.load_lora_weights
If you do it this way, it should work without any problem. That's how I have been doing it for all the models.
Regarding the save/load, we actually have a fully working PR that fully supports saving/loading, it's not merged yet: https://github.com/huggingface/transformers/pull/33141
Thank you very much for your advice🫡. I am browsing that PR now.It is convenient that transformers fully supports saving/loading. Hope your work goes well.And I will follow your code👌.
pip install trl==
One last question,which trl version are you using?Mine is trl==0.9.6 hqq==0.1.8 transformers==4.40.0
,which may lead to some errors.
tokenizer.pad_token = tokenizer.unk_token #tokenizer.eos_token
reports:
ValueError: Asking to pad but the tokenizer does not have a padding token. Please select a token to use as `pad_token` `(tokenizer.pad_token = tokenizer.eos_token e.g.)` or add a new pad token via `tokenizer.add_special_tokens({'pad_token': '[PAD]'})`.
and tokenizer.pad_token = tokenizer.eos_token
reports:
TypeError: WrappedModel.train() takes 1 positional argument but 2 were given
I solve that problem by changing WrappedModel code:
class WrappedModel(torch.nn.Module):
def __init__(self, model):
super().__init__()
self.model = model
def forward(self, *args, **kwargs):
return self.model.forward(*args, **kwargs)
def train(self,*args): #here I change
self.model.train(*args)
def eval(self):
self.model.eval()
def parameters(self):
return self.model.parameters()
But sadly,my device's memory is not big enough so OOM because the model is only loaded on one GPU.I wish you can complete the save/load of hqq-quantized model in transformers👍.Distributed running is very helpful for consumer-grade GPUs🫡.
You need ~16GB for 1024 tokens. You can reduce the VRAM usage by using AdamW 8-bit optimizer for example, or reduce the tokens from 1024 to 512
Great!
😭,When saving the checkpoint, trainer.train() reported an error: HQQLinearLoRA.state_dict() got an unexpected keyword argument 'destination'.
So I modified the HQQLinearLoRA.state_dict() function:
def state_dict(self,destination,prefix,keep_vars,*args):#(self):
return {
"lora_A": self.lora_A.data,
"lora_B": self.lora_B.data,
"scaling": self.scaling,
"bias": self.bias,
}
The trainer can now save/load the checkpoints correctly. But I don't know if this has any impact on the fine-tuning results.
Oh thanks for mentioning that :thinking: , which version of transformers are you using?
transformers == 4.44.2 trl == 0.9.6 hqq == 0.1.8 I have tried transformers == 4.40.0 before.And I got the same error.
Alright, will take a look a look at it later :thinking:
Did you get this error when running it? I think it may be a problem with the trl version. I think the checkpoint result is a little strange after modifying state_dict(). The model.safetensors file appears in the checkpoint folder. And we know that the hqq quantized model cannot be saved in safetensor format for the time being. So I think there is a high possibility of error. The error occurs in site-packages/torch/nn/modules/module.py line 1939.
for name, module in self._modules.items():
if module is not None:
->module.state_dict(destination=destination, prefix=prefix + name + '.', keep_vars=keep_vars)
And my pytorch version is 2.4.0.Thanks for your help👍!
So I just tried the example I shared with you: https://github.com/mobiusml/hqq/blob/master/examples/lora/hqq_plus.py
and saving/loading works just fine with PeftUtils
. If ti's a problem with trl
the best is to open an issue in their repo since I don't really know what's going on with it. If you use PeftUtils
from hqq as recommend it should work without any issue:
Training
1- Load/quantize the model
2- Add lora: PeftUtils.add_lora(model, lora_params)
3- Train
4- Save the lora weights PeftUtils.save_lora_weights(model, filename)
Inference
1- Load/quantize the model
2- load the lora weights PeftUtils.load_lora_weights(model, filename)
Or
You can also load the quantized model directly with the saved lora weights via AutoHQQHFModel.from_quantized
https://github.com/mobiusml/hqq/blob/master/hqq/models/base.py#L464
OK,I know.I'm so sorry for taking up so much of your time and energy.I mean,this is not a problem with the hqq library. I think it should be a problem with the incompatibility of the trl version. -- Can you tell me your trl library version?🫣
Thank you very much! You are really helpful!
I just did pip install trl
and used the latest version !
Oh,there must be something wrong with my environment.I'll find a solution by myself.At least I can fine-tune successfully using the Trainer.Thank you anyway🫡!
1.Loss doesn't decrease but varies up and down 2.The text generated by the model is very poor I think the reason is that the parameters are not set well enough. Here is my code.May you give me some suggestions? quantization code:
add adapter code:
and training code:
The performance of the model is much better than before fine-tuning, but not good enough. Is it because I didn’t train enough?
Originally posted by @zxbjushuai in https://github.com/mobiusml/hqq/issues/107#issuecomment-2330509939