microsoft / DeepSpeed

DeepSpeed is a deep learning optimization library that makes distributed training and inference easy, efficient, and effective.
https://www.deepspeed.ai/
Apache License 2.0
33.94k stars 3.98k forks source link

[BUG] 'type:transformer' partitioning doesn't ensure non-zero parameters on each pipeline rank. #5078

Open siddharth9820 opened 5 months ago

siddharth9820 commented 5 months ago

Running Megatron-Deepspeed with pipelining seems to call PipeModule with the type:transformer partioning method which leads to this line of code - (https://github.com/microsoft/DeepSpeed/blob/2eafe41be7049721b77c2f2b0ee702fea1702239/deepspeed/runtime/pipe/module.py#L391)

I tried running this with a model with 42 layers, tensor parallel=4, and pipeline=16. pipe ranks 15 and 16 were assigned 0 layers. Something needs to be changed to ensure that non-zero layers are assigned to each rank.

tjruwase commented 5 months ago

@siddharth9820, thanks for reporting this error. I am curious if this is a recent regression due to the below PR that changed the balancing algorithm: https://github.com/microsoft/DeepSpeed/pull/4312

Can you please try earlier DS versions (v. 0.13.0 or 0.12.6) or revert the PR?

siddharth9820 commented 5 months ago

@tjruwase I am able to reproduce the error outside of Megatron-DeepSpeed as well - image

I'll try the other versions too. Thanks for the pointer.

About potential fixes. - Could you first assign 1 layer to each rank first and then run this function on n-m layers and m ranks? But that wouldn't be an ideal fix if the weights aren't uniform.

tjruwase commented 5 months ago

@siddharth9820, thanks for the update. This seems like an implementation bug as I find it hard to believe both the new and old algorithms fail these seemingly practical cases.

  1. Old algorithm - Fast Optimal Load Balancing Algorithms for 1D Partitioning
  2. New algorithm - https://www8.cs.umu.se/kurser/TDBAfl/VT06/algorithms/BOOK/BOOK2/NODE45.HTM
tjruwase commented 5 months ago

About potential fixes. - Could you first assign 1 layer to each rank first and then run this function on n-m layers and m ranks? But that wouldn't be an ideal fix if the weights aren't uniform.

Yes, it does not seem like this approach would be balanced. I think it will only increase the minimum from zero to one. Right?

siddharth9820 commented 5 months ago

Yes it won't be balanced. But atleast it will "run" with Megatron Deepspeed. With the current approach, I was getting "empty parameter" errors during optimizer initialization. I believe this was happening on the second last pp rank, since it became parameterless.