"RuntimeError: The size of tensor a (0) must match the size of tensor b (4096) at non-singleton dimension 1" (DPO + LoRA)

ohmeow commented 1 year ago

So I'm attempting to run the DPO LoRA script and I'm getting this error:

RuntimeError: The size of tensor a (0) must match the size of tensor b (4096) at non-singleton dimension 1

... when the model.merge_and_load() runs here:

base_model = AutoModelForCausalLM.from_pretrained(peft_config.base_model_name_or_path, **model_kwargs)
model = PeftModel.from_pretrained(base_model, model_args.model_name_or_path, revision=model_args.model_revision)
model.eval()
model = model.merge_and_unload()

Any ideas?

ohmeow commented 1 year ago

NOTE: This only occurs if I'm using the deepspeed accelerate config and set num_processes > 1

ohmeow commented 1 year ago

So I think the solution to add accelerator.wait_for_everyone() you instantiate the DPOTrainer.

If someone can confirm that feel free to close this out. If not, lmk :)

ohmeow commented 1 year ago

I think the problem might be related to using deepspeed on my local DL rig with 2x3090s. Just switched to the multi-gpu.yaml file and the script ran no problem.

lewtun commented 1 year ago

Hi @ohmeow as discussed here I think indeed the issue is when trying to do the following:

Use DeepSpeed's zero.init() to shard the base model weights directly on GPU via this flag in the accelerate config
Try to merge the adapter weights on the sharded base model

I don't think we saw this issue in the original release of the code because we made a goof on the device_map for LoRA training that was later fixed in #51

If you have enough vRAM then one should be able to workaround this by setting zero3_init_flag: False in the accelerate config.

I'm discussing this with the peft team and hopefully can find a more stable solution!

ohmeow commented 1 year ago

The only way I was able to get training to proceed was by adding device_map=get_kbit_device_map() to the model_kwargs when loading an adapter model.

    if is_adapter_model(model, model_args.model_revision):
        # load the model, merge the adapter weights and unload the adapter
        # Note: to run QLora, you will need to merge the based model separately as the merged model in 16bit
        logger.info(f"Merging peft adapters for {model_args.model_name_or_path=}")

        peft_config = PeftConfig.from_pretrained(model_args.model_name_or_path, revision=model_args.model_revision)

        model_kwargs = dict(
            revision=model_args.base_model_revision,
            trust_remote_code=model_args.trust_remote_code,
            use_flash_attention_2=model_args.use_flash_attention_2,
            torch_dtype=torch_dtype,
            use_cache=False if training_args.gradient_checkpointing else True,
            device_map=get_kbit_device_map(),
        )

        base_model = AutoModelForCausalLM.from_pretrained(peft_config.base_model_name_or_path, **model_kwargs)
        model = PeftModel.from_pretrained(base_model, model_args.model_name_or_path, revision=model_args.model_revision)
        model.eval()
        model = model.merge_and_unload()
        model_kwargs = None

    if model_args.use_peft is True:
        ref_model = None
        ref_model_kwargs = None
    else:
        ref_model = model
        ref_model_kwargs = model_kwargs

    accelerator.wait_for_everyone()

With this I can get everything running on my 2x3090s using the multi-gpu.yaml. GPU utilization looks even across both cards.

The deepspeed config works as well but for some reason fails when pushing the model to the hub. I imagine this has something to do with my machine and/or with using 3090s.

Randl commented 12 months ago

Can confirm that setting zero3_init_flag: False helps.

hhhhuiyuan commented 8 months ago

I think the problem might be related to using deepspeed on my local DL rig with 2x3090s. Just switched to the multi-gpu.yaml file and the script ran no problem.

Having the same issue here, but wierdly, the DPO script can not run even with multi-gpu.yaml on my machine, could you please share your multi-gpu.yaml file? In my understanding, multi-gpu.yaml is for data parallelising, so it should not have problem with merge Qlora adaptator.

huggingface / alignment-handbook

"RuntimeError: The size of tensor a (0) must match the size of tensor b (4096) at non-singleton dimension 1" (DPO + LoRA) #57