Abnormal performance of training LLaMA3.1-70 via LoRA

junzhang-zj commented 1 month ago

When I try to fine-tune both LLaMA-2-70B and LLaMA-3.1-70B with LoRA using the same code, the 3.1 seems to have an unusual loss landscape, is there anything I should be aware of?

  torch.nn.Linear.reset_parameters = lambda x: None
  model = AutoModelForCausalLM.from_pretrained(args.base_model,
                                             torch_dtype=torch.bfloat16, 
                                             attn_implementation="flash_attention_2",
                                             device_map="auto"
                                             )
      config = LoraConfig(
        r=args.lora_r,
        lora_alpha=args.lora_alpha,
        target_modules=args.lora_target_modules.split(","),
        lora_dropout=args.lora_dropout,
        bias="none",
        task_type="CAUSAL_LM",
    )
    model = get_peft_model(model, config)
    print('online lora with trained pruned offline model',model)
    print_trainable_parameters(model)

junzhang-zj commented 1 month ago

BenjaminBossan commented 1 month ago

Very hard to say just from this information. I assume you target the same layers for both? So print_trainable_parameters should give you (almost) the same values? Perhaps Llama3 works better with different hyper-parameters, but I haven't tested it myself.

junzhang-zj commented 1 month ago

Thanks for your help. The target layer is the same, I will try other hyper parameters.

junzhang-zj commented 1 month ago

The problem was with the pre-saved dataset I was processing with the LLaMA-2 tokenizer.

huggingface / peft

Abnormal performance of training LLaMA3.1-70 via LoRA #2091