Can you please share the results you get with the trained models?

KKcorps commented 1 year ago

I have been trying to train a llama-7b as well as redpajama-3b using the official qlora repo.

The results however are not great. Most of the time I see a lot of repetitions or gibberish.

I have trained it on a small dataset of only 2000 rows but for 10 epochs.

Both train/eval loss were decreasing the whole time.

Only difference I see in their code and yours is that you don't attach lora to every linear layer as mentioned in the paper.

anyili commented 1 year ago

add this to the code

 def get_accelerate_model(base_model: str = '',
                         lora_r: int = 8,
                         lora_alpha: int = 16,
                         lora_dropout: float = 0.05,
                         device_map: str = 'auto'):
    n_gpus = torch.cuda.device_count()
    max_memory = f'80000MB'
    max_memory = {i: max_memory for i in range(n_gpus)}

    print(f'loading base model {base_model}...')
    compute_dtype = torch.bfloat16
    model = AutoModelForCausalLM.from_pretrained(
        base_model,
        load_in_4bit=True,
        device_map=device_map,
        max_memory=max_memory,
        quantization_config=BitsAndBytesConfig(
            load_in_4bit=True,
            llm_int8_threshold=6.0,
            llm_int8_has_fp16_weight=False,
            bnb_4bit_compute_dtype=compute_dtype,
            bnb_4bit_use_double_quant=True,
            bnb_4bit_quant_type='nf4'
        ),
        torch_dtype=torch.bfloat16,
        trust_remote_code=False,
    )

    setattr(model, 'model_parallel', True)
    setattr(model, 'is_parallelizable', True)
    modules = [
        "gate_proj",
        "down_proj",
        "up_proj",
        "q_proj",
        "k_proj",
        "v_proj",
        "o_proj"
    ]

    model.config.torch_dtype = torch.bfloat16

    model = prepare_model_for_kbit_training(model)

    config = LoraConfig(
        r=lora_r,
        lora_alpha=lora_alpha,
        target_modules=modules,
        lora_dropout=lora_dropout,
        bias="none",
        task_type="CAUSAL_LM",
    )

    print(f'adding LoRA modules...')
    model = get_peft_model(model, config)

    for name, module in model.named_modules():
        if isinstance(module, LoraLayer):
            module = module.to(torch.bfloat16)
        if 'norm' in name:
            module = module.to(torch.float32)
        if 'lm_head' in name or 'embed_tokens' in name:
            if hasattr(module, 'weight'):
                if module.weight.dtype == torch.float32:
                    module = module.to(torch.bfloat16)
    return model

replace the portion of

     model = AutoModelForCausalLM.from_pretrained(
         base_model,
         quantization_config=bnb_config,
         device_map=device_map,
     )

     model = prepare_model_for_kbit_training(model)

     config = LoraConfig(
         r=lora_r,
         lora_alpha=lora_alpha,
         target_modules=lora_target_modules,
         lora_dropout=lora_dropout,
         bias="none",
         task_type="CAUSAL_LM",
     )
     model = get_peft_model(model, config)

with

model = get_accelerate_model(
        base_model=base_model,
        lora_alpha=lora_alpha,
        lora_r=lora_r,
        lora_dropout=lora_dropout,
        device_map=device_map
    )

It works for me.

KKcorps commented 1 year ago

@anyili Have you tested the lora generated using this? How well does it perform?

anyili commented 1 year ago

The training works, but I haven't tested the output. Doing it now

anyili commented 1 year ago

I used alpaca gpt4 cleaned data to finetune, it works great. One catch is not using paged_adamw_8bit, the loss will keep increasing, instead of using paged_adamw_32bit.

KKcorps commented 1 year ago

Awesome @anyili . If you don't mind, can you share the training params here. I am interested in knowing how many epochs did you train it for on what was the base model, how many lora params?

It's understandable if you don't want to share it as well.

anyili commented 1 year ago

batch_size: 128
micro_batch_size: 1
gradient_accumulation_steps: 8
num_epochs: 3
learning_rate: 0.0003
cutoff_len: 1024
val_set_size: 1500
lora_r: 64
lora_alpha: 16
lora_dropout: 0.05
lora_target_modules: ['gate_proj', 'down_proj', 'up_proj', 'q_proj', 'k_proj', 'v_proj', 'o_proj']

my base model is 65B. I trained on 1 epoch

anyili commented 1 year ago

@KKcorps somehow the memory usage for this repo is quite different from the original repo. The original repo has much smaller memory figure print. I tried 65B model, almost identical input, this repo will give me almost 70G usage, but the original one is 30Gb. Something is not quite right...

vihangd commented 1 year ago

@anyili I would appreciate it if you could update code with the latest modifications and then execute the fine tuning process again to evaluate its impact on the performance. You don't need to run it for too long, just a few iterations should be sufficient to observe the behaviour. Here are the memory utilization graphs for the fork and the original respectively for run with 1 epoch with same arguments https://api.wandb.ai/links/vihangd/momqxt7x https://api.wandb.ai/links/vihangd/w6ve1vc0 . Based on this QLoRA version(this fork uses slightly less Memory).

anyili commented 1 year ago

@vihangd I will try. For the memory usage from wandb, is that trained on 7b or 65b?

vihangd commented 1 year ago

@anyili It was trained on 7b. I am planning to try it out 65b soon

vihangd / alpaca-qlora

Can you please share the results you get with the trained models? #1