tloen / alpaca-lora

Instruct-tune LLaMA on consumer hardware
Apache License 2.0
18.39k stars 2.19k forks source link

Fine-tune argument resume_from_checkpoint starts from scratch instead of from checkpoint #585

Open prpercival opened 9 months ago

prpercival commented 9 months ago

In finetune.py there is the following section to support resuming from a checkpoint, but you may note that resume_from_checkpoint is set to false if pytorch_model.bin does not exist even though it also seems to support a checkpoint_name of adapter_model.bin. This will cause the finetune.py to start from scratch even if a seemingly valid resume_from_checkpoint argument is supplied.

If I move line 200 into the if/else on line 204 and line 208 it will resume my fine-tune from my adapter_model.bin as expected. Is there something I'm missing here? Should I not be resuming from certain checkpoints?

if resume_from_checkpoint:
    # Check the available weights and load them
    checkpoint_name = os.path.join(
        resume_from_checkpoint, "pytorch_model.bin"
    )  # Full checkpoint
    if not os.path.exists(checkpoint_name):
        checkpoint_name = os.path.join(
            resume_from_checkpoint, "adapter_model.bin"
        )  # only LoRA model - LoRA config above has to fit
        resume_from_checkpoint = (
            False  # So the trainer won't try loading its state
        )
    # The two files above have a different name depending on how they were saved, but are actually the same.
    if os.path.exists(checkpoint_name):
        print(f"Restarting from {checkpoint_name}")
        adapters_weights = torch.load(checkpoint_name)
        set_peft_model_state_dict(model, adapters_weights)
    else:
        print(f"Checkpoint {checkpoint_name} not found")

to

if resume_from_checkpoint:
    # Check the available weights and load them
    checkpoint_name = os.path.join(
        resume_from_checkpoint, "pytorch_model.bin"
    )  # Full checkpoint
    if not os.path.exists(checkpoint_name):
        checkpoint_name = os.path.join(
            resume_from_checkpoint, "adapter_model.bin"
        )  # only LoRA model - LoRA config above has to fit
    # The two files above have a different name depending on how they were saved, but are actually the same.
    if os.path.exists(checkpoint_name):
        print(f"Restarting from {checkpoint_name}")
        adapters_weights = torch.load(checkpoint_name)
        set_peft_model_state_dict(model, adapters_weights)
        resume_from_checkpoint = (
            True
        )
    else:
        print(f"Checkpoint {checkpoint_name} not found")
        resume_from_checkpoint = (
            False  # So the trainer won't try loading its state
        )
amina-mardiyyah commented 5 months ago

I am also quite curious about this; if someone could provide insight, I would appreciate it. I am training on a GPU cluster that runs my code for a certain amount of time. Afterward, I need to resubmit the job to reconnect or connect to a node and then resume training. It's not efficient if every time I disconnect, training resumes from scratch. Is there a better way to handle this?