Closed KKcorps closed 1 year ago
add this to the code
def get_accelerate_model(base_model: str = '',
lora_r: int = 8,
lora_alpha: int = 16,
lora_dropout: float = 0.05,
device_map: str = 'auto'):
n_gpus = torch.cuda.device_count()
max_memory = f'80000MB'
max_memory = {i: max_memory for i in range(n_gpus)}
print(f'loading base model {base_model}...')
compute_dtype = torch.bfloat16
model = AutoModelForCausalLM.from_pretrained(
base_model,
load_in_4bit=True,
device_map=device_map,
max_memory=max_memory,
quantization_config=BitsAndBytesConfig(
load_in_4bit=True,
llm_int8_threshold=6.0,
llm_int8_has_fp16_weight=False,
bnb_4bit_compute_dtype=compute_dtype,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type='nf4'
),
torch_dtype=torch.bfloat16,
trust_remote_code=False,
)
setattr(model, 'model_parallel', True)
setattr(model, 'is_parallelizable', True)
modules = [
"gate_proj",
"down_proj",
"up_proj",
"q_proj",
"k_proj",
"v_proj",
"o_proj"
]
model.config.torch_dtype = torch.bfloat16
model = prepare_model_for_kbit_training(model)
config = LoraConfig(
r=lora_r,
lora_alpha=lora_alpha,
target_modules=modules,
lora_dropout=lora_dropout,
bias="none",
task_type="CAUSAL_LM",
)
print(f'adding LoRA modules...')
model = get_peft_model(model, config)
for name, module in model.named_modules():
if isinstance(module, LoraLayer):
module = module.to(torch.bfloat16)
if 'norm' in name:
module = module.to(torch.float32)
if 'lm_head' in name or 'embed_tokens' in name:
if hasattr(module, 'weight'):
if module.weight.dtype == torch.float32:
module = module.to(torch.bfloat16)
return model
replace the portion of
model = AutoModelForCausalLM.from_pretrained(
base_model,
quantization_config=bnb_config,
device_map=device_map,
)
model = prepare_model_for_kbit_training(model)
config = LoraConfig(
r=lora_r,
lora_alpha=lora_alpha,
target_modules=lora_target_modules,
lora_dropout=lora_dropout,
bias="none",
task_type="CAUSAL_LM",
)
model = get_peft_model(model, config)
with
model = get_accelerate_model(
base_model=base_model,
lora_alpha=lora_alpha,
lora_r=lora_r,
lora_dropout=lora_dropout,
device_map=device_map
)
It works for me.
@anyili Have you tested the lora generated using this? How well does it perform?
The training works, but I haven't tested the output. Doing it now
I used alpaca gpt4 cleaned data to finetune, it works great. One catch is not using paged_adamw_8bit
, the loss will keep increasing, instead of using paged_adamw_32bit
.
Awesome @anyili . If you don't mind, can you share the training params here. I am interested in knowing how many epochs did you train it for on what was the base model, how many lora params?
It's understandable if you don't want to share it as well.
batch_size: 128
micro_batch_size: 1
gradient_accumulation_steps: 8
num_epochs: 3
learning_rate: 0.0003
cutoff_len: 1024
val_set_size: 1500
lora_r: 64
lora_alpha: 16
lora_dropout: 0.05
lora_target_modules: ['gate_proj', 'down_proj', 'up_proj', 'q_proj', 'k_proj', 'v_proj', 'o_proj']
my base model is 65B. I trained on 1 epoch
@KKcorps somehow the memory usage for this repo is quite different from the original repo. The original repo has much smaller memory figure print. I tried 65B model, almost identical input, this repo will give me almost 70G usage, but the original one is 30Gb. Something is not quite right...
@anyili I would appreciate it if you could update code with the latest modifications and then execute the fine tuning process again to evaluate its impact on the performance. You don't need to run it for too long, just a few iterations should be sufficient to observe the behaviour. Here are the memory utilization graphs for the fork and the original respectively for run with 1 epoch with same arguments https://api.wandb.ai/links/vihangd/momqxt7x https://api.wandb.ai/links/vihangd/w6ve1vc0 . Based on this QLoRA version(this fork uses slightly less Memory).
@vihangd I will try. For the memory usage from wandb, is that trained on 7b or 65b?
@anyili It was trained on 7b. I am planning to try it out 65b soon
I have been trying to train a llama-7b as well as redpajama-3b using the official qlora repo.
The results however are not great. Most of the time I see a lot of repetitions or gibberish.
I have trained it on a small dataset of only 2000 rows but for 10 epochs.
Both train/eval loss were decreasing the whole time.
Only difference I see in their code and yours is that you don't attach lora to every linear layer as mentioned in the paper.