Closed Ledzy closed 1 week ago
I managed to solve this issue by
after switching the trainable parameters each time. This solution is more time-efficient compared to rerun deepspeed.initialize
each time. Interested readers may refer to this repo, where i implement the main logics in _switch_trainable_params_zero3
.
Thanks for adding your solution @Ledzy
Thank you for your great contribution!
Problem I want to solve
I would like to know how to safely switch the trainable parameters during ZeRO-3 stage. Consider a network with 2 layers, my objective is to train the first layer for fixed number of steps, and then switch to the second layer to train for the same number of iterations, then switch back to the first layer and repeat the procedure.
Encountered issues and observations
A straightforward way is to change the
requires_grad
attribute of the active layer. However, If i set 1st layer'srequires_grad=True
and 2nd layer'srequires_grad=False
in this beginning, then the only the 1st layer will be updated, even when the 2nd layer'srequires_grad
becomes True in the later training phase.In particular, the
averaged_gradients
of theDeepSpeedZeroOptimizer_Stage3
is a all zero vector when training the 2nd layer. I found that the correspondingparam.requires_grad
is False in this line, even when I have set the original model's 2nd layerrequires_grad=True
. I suppose this issue is related to Deepspeed's gradient synchronization mechanism. I guess thedeepspeed.initialize
already determines which parameters should do gradient synchronization / need gradient, based on therequires_grad
attribute, just like the pytorch DDP.I have spent weeks in resolving this issue but still cannot find a clean and feasible approach except re-run
deepspeed.initialize
each time, which is too time-consuming and is not convenient when combining other frameworks like Huggingface's Trainer. Could you offer me some guidance on this problem? Switching trainable parameters may be an important feature for memory-efficient optimization of LLM. Any help would be greatly appreciated!For your reference, here is the code i use to test ZeRO-3. The
badam.BlockOptimizer
wraps the original optimizer to automatically switch trainable parameters for fixed number of iterations, which you may use it viapip install badam
. Its logic is rather straightforward; see source code here.Here is the configuration file "ds_config_badam.json" that i used
Please let me know if you need any additional information. Thank you so much for your time and your effort! cc @tjruwase