Open siddharth9820 opened 5 months ago
@siddharth9820, thanks for reporting this error. I am curious if this is a recent regression due to the below PR that changed the balancing algorithm: https://github.com/microsoft/DeepSpeed/pull/4312
Can you please try earlier DS versions (v. 0.13.0 or 0.12.6) or revert the PR?
@tjruwase I am able to reproduce the error outside of Megatron-DeepSpeed as well -
I'll try the other versions too. Thanks for the pointer.
About potential fixes. - Could you first assign 1 layer to each rank first and then run this function on n-m layers and m ranks? But that wouldn't be an ideal fix if the weights aren't uniform.
@siddharth9820, thanks for the update. This seems like an implementation bug as I find it hard to believe both the new and old algorithms fail these seemingly practical cases.
About potential fixes. - Could you first assign 1 layer to each rank first and then run this function on n-m layers and m ranks? But that wouldn't be an ideal fix if the weights aren't uniform.
Yes, it does not seem like this approach would be balanced. I think it will only increase the minimum from zero to one. Right?
Yes it won't be balanced. But atleast it will "run" with Megatron Deepspeed. With the current approach, I was getting "empty parameter" errors during optimizer initialization. I believe this was happening on the second last pp rank, since it became parameterless.
Running Megatron-Deepspeed with pipelining seems to call PipeModule with the type:transformer partioning method which leads to this line of code - (https://github.com/microsoft/DeepSpeed/blob/2eafe41be7049721b77c2f2b0ee702fea1702239/deepspeed/runtime/pipe/module.py#L391)
I tried running this with a model with 42 layers, tensor parallel=4, and pipeline=16. pipe ranks 15 and 16 were assigned 0 layers. Something needs to be changed to ensure that non-zero layers are assigned to each rank.