Closed Skylight-Lark closed 6 months ago
Hi @Skylight-Lark
Thank you for your interest in our work. I notice another person also faced the same error, however not yet sure about what causing it. Could you please try using zero3.json
or zero2.json
instead of zero3_offload.json
and see if it solves the issue?
Thank You
Hi @mmaaz60 Thank you for your quick response. when i use the zore3.json , it also has the same problem. And when using the zero2.json, it will OOM.
Hi @Skylight-Lark
May I know which deepspeed version are you using? Try to use version 0.13.1
and it may help. Further in llava_arch.py
, moving the mm_projector initialization code outside for loop may help as well. (https://github.com/haotian-liu/LLaVA/blob/c121f0432da27facab705978f83c4ada465e46fd/llava/model/llava_arch.py#L36)
Hi @mmaaz60
it works when moving the mm_projector initialization code outside for loop. Thank you for your patience in solving the issue.
Hi @mmaaz60
it works when moving the mm_projector initialization code outside for loop. Thank you for your patience in solving the issue.
Hi, I encountered the same issue while fine-tuning LoRA. Could you please share your solution if possible?
Issue Title: use the finetune script but meet error
Environment
Issue Description
when i used the llava-pp codebase and the finetune script to finetune our model, it appeared that :
but when i use the llava official codebase and script , i didn't meet that error, which is weird! it seems that the deepspeed zero3 has some bug and can't gather the sharded parameters.
I've tried various approaches in the deepspeed issues and transformers issues to fix this issue but haven't been successful. Any help would be greatly appreciated!
Full Script
Full StackTrace