Open ldh127 opened 6 months ago
@ldh127, does the following help? https://deepspeed.readthedocs.io/en/latest/model-checkpointing.html#zero-checkpoint-fp32-weights-recovery
yes,but i think this code is for ds2universe model param ,not for merging multi optim file into to one file , it can process merge deepspeed multi gpu optim file into one pytorch optim.pt file?
ldh @.***
------------------ 原始邮件 ------------------ 发件人: @.>; 发送时间: 2024年4月25日(星期四) 凌晨3:40 收件人: @.>; 抄送: @.>; @.>; 主题: Re: [microsoft/DeepSpeed] [REQUEST] i want to know how to merge deepspeed multi gpu optim file into one pytorch optim.pt file ? (Issue #5460)
@ldh127, does the following help? https://deepspeed.readthedocs.io/en/latest/model-checkpointing.html#zero-checkpoint-fp32-weights-recovery
— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you were mentioned.Message ID: @.***>
Hi @ldh127 - can you please be more specific, share more about what you are trying to do and what errors you are hitting?
yes ,i use transformers trainer to call deepspeed , it save the deepspeed checkpoint which contains multi gpu model and optim file , i want just one file optim.pt file to choosing sft data , my code can only load one global optim.pt , but deepspeed checkpoint get multi part optim and model file , how can i. merge multi optim file to one global file ?
yes,but i think this code is for ds2universe model param ,not for merging multi optim file into to one file , it can process merge deepspeed multi gpu optim file into one pytorch optim.pt file? ldh @.*** …
@ldh127, why do you say the link is related to ds2universal? Did you try it? Can you clarify how your scenario is different from the use case below? Thanks!
yes,but i think this code is for ds2universe model param ,not for merging multi optim file into to one file , it can process merge deepspeed multi gpu optim file into one pytorch optim.pt file? ldh @.*** …
@ldh127, why do you say the link is related to ds2universal? Did you try it? Can you clarify how your scenario is different from the use case below? Thanks!
![]()
yes , i try this code , finally i surely get only one .pth file, but you can see my details, i
this is my deepspeed checkpoint file, i use your code to read this folder ,and finally it merge and save only one file , i use this code ,you can see
,and i get the file like this ,
, you can see that i print the state_dict name, it is like base_model.model.model.layers.38.self_attn.q_proj.lora_A.default.weight
base_model.model.model.layers.38.self_attn.q_proj.lora_B.default.weight
base_model.model.model.layers.38.self_attn.k_proj.lora_A.default.weight
base_model.model.model.layers.38.self_attn.k_proj.lora_B.default.weight
, but it seems the models name ,not the optim file name ?
yes,but i think this code is for ds2universe model param ,not for merging multi optim file into to one file , it can process merge deepspeed multi gpu optim file into one pytorch optim.pt file? ldh @.*** …
@ldh127, why do you say the link is related to ds2universal? Did you try it? Can you clarify how your scenario is different from the use case below? Thanks!
![]()
you can see the uppon picture , if the finally file which named demo_state_dict.pth contains optim param ,but how can i get the optim state_dict ? if it is the merged optim file , it seems i can use state_dict["optim_state"] like this way to get the only one optim dict ,but it has no optim_state key in the dict , so i donot konw what error in my operate steps
yes,but i think this code is for ds2universe model param ,not for merging multi optim file into to one file , it can process merge deepspeed multi gpu optim file into one pytorch optim.pt file? ldh @.*** …
@ldh127, why do you say the link is related to ds2universal? Did you try it? Can you clarify how your scenario is different from the use case below? Thanks!
![]()
i also read the code in this url: https://github.com/microsoft/DeepSpeed/blob/4c15ad9f8d51a1950842c69bbbc9d93c73afbcfc/deepspeed/utils/zero_to_fp32.py , but i do not know if i need to update what code , can you give me more detail help? thanks , need some detail
Hi @ldh127 - can you please be more specific, share more about what you are trying to do and what errors you are hitting?