shibing624 / MedicalGPT

MedicalGPT: Training Your Own Medical GPT Model with ChatGPT Training Pipeline. 训练医疗大模型,实现了包括增量预训练(PT)、有监督微调(SFT)、RLHF、DPO、ORPO。
Apache License 2.0
3.24k stars 492 forks source link

關於模型合併權重的問題 #117

Closed richard880502 closed 1 year ago

richard880502 commented 1 year ago

merge_peft_adapter.py

在對 baichuan13B 做 merge的時候出現了

Traceback (most recent call last): File "merge_peft_adapter.py", line 110, in <module> main() File "merge_peft_adapter.py", line 93, in main lora_model = PeftModel.from_pretrained( File "/home/largitdata/miniconda3/envs/chatglm/lib/python3.8/site-packages/peft/peft_model.py", line 271, in from_pretrained model.load_adapter(model_id, adapter_name, is_trainable=is_trainable, **kwargs) File "/home/largitdata/miniconda3/envs/chatglm/lib/python3.8/site-packages/peft/peft_model.py", line 581, in load_adapter max_memory = get_balanced_memory( File "/home/largitdata/miniconda3/envs/chatglm/lib/python3.8/site-packages/accelerate/utils/modeling.py", line 753, in get_balanced_memory per_gpu = module_sizes[""] [//](https://github.com/shibing624/MedicalGPT/issues/new?assignees=&labels=question&projects=&template=usage-question.md&title=) (num_devices - 1 if low_zero else num_devices) ZeroDivisionError: integer division or modulo by zero

經過一番搜尋後我嘗試修改了accelerate/utils folder in site-packages 的代碼,把原本的0替換成1

max_memory = {i: torch.cuda.mem_get_info(i)[1] for i in range(torch.cuda.device_count())}

再次進行合併時報錯顯示內存不足,想知道在merge階段是否除了更換硬體設備之外別無他法了,merge不像sft的代碼可以利用qlora這選項節省記憶體空間. 我用的是3090 24G 謝謝!

richard880502 commented 1 year ago

後來參考這裡,對merge_peft_adapter.py做一些修改就可以成功合併了!