Open ohmeow opened 1 year ago
NOTE: This only occurs if I'm using the deepspeed accelerate config and set num_processes
> 1
So I think the solution to add accelerator.wait_for_everyone()
you instantiate the DPOTrainer
.
If someone can confirm that feel free to close this out. If not, lmk :)
I think the problem might be related to using deepspeed on my local DL rig with 2x3090s. Just switched to the multi-gpu.yaml file and the script ran no problem.
Hi @ohmeow as discussed here I think indeed the issue is when trying to do the following:
zero.init()
to shard the base model weights directly on GPU via this flag in the accelerate
config I don't think we saw this issue in the original release of the code because we made a goof on the device_map
for LoRA training that was later fixed in #51
If you have enough vRAM then one should be able to workaround this by setting zero3_init_flag: False
in the accelerate
config.
I'm discussing this with the peft
team and hopefully can find a more stable solution!
The only way I was able to get training to proceed was by adding device_map=get_kbit_device_map()
to the model_kwargs
when loading an adapter model.
if is_adapter_model(model, model_args.model_revision):
# load the model, merge the adapter weights and unload the adapter
# Note: to run QLora, you will need to merge the based model separately as the merged model in 16bit
logger.info(f"Merging peft adapters for {model_args.model_name_or_path=}")
peft_config = PeftConfig.from_pretrained(model_args.model_name_or_path, revision=model_args.model_revision)
model_kwargs = dict(
revision=model_args.base_model_revision,
trust_remote_code=model_args.trust_remote_code,
use_flash_attention_2=model_args.use_flash_attention_2,
torch_dtype=torch_dtype,
use_cache=False if training_args.gradient_checkpointing else True,
device_map=get_kbit_device_map(),
)
base_model = AutoModelForCausalLM.from_pretrained(peft_config.base_model_name_or_path, **model_kwargs)
model = PeftModel.from_pretrained(base_model, model_args.model_name_or_path, revision=model_args.model_revision)
model.eval()
model = model.merge_and_unload()
model_kwargs = None
if model_args.use_peft is True:
ref_model = None
ref_model_kwargs = None
else:
ref_model = model
ref_model_kwargs = model_kwargs
accelerator.wait_for_everyone()
With this I can get everything running on my 2x3090s using the multi-gpu.yaml
. GPU utilization looks even across both cards.
The deepspeed config works as well but for some reason fails when pushing the model to the hub. I imagine this has something to do with my machine and/or with using 3090s.
Can confirm that setting zero3_init_flag: False
helps.
I think the problem might be related to using deepspeed on my local DL rig with 2x3090s. Just switched to the multi-gpu.yaml file and the script ran no problem.
Having the same issue here, but wierdly, the DPO script can not run even with multi-gpu.yaml on my machine, could you please share your multi-gpu.yaml file? In my understanding, multi-gpu.yaml is for data parallelising, so it should not have problem with merge Qlora adaptator.
So I'm attempting to run the DPO LoRA script and I'm getting this error:
... when the
model.merge_and_load()
runs here:Any ideas?