sail-sg / zero-bubble-pipeline-parallelism

Zero Bubble Pipeline Parallelism
Other
251 stars 12 forks source link

[QUESTION] I used the zero-bubble commit 7ad9c81d for my experiments, and found that the memory usage of this zb-v model exceeds that of the previous zb1 model. What could be the issue? The specific configuration and results are shown in the image. #36

Open lbk-sys opened 1 month ago

lbk-sys commented 1 month ago

1721284001438-jws (1)

lbk-sys commented 1 month ago

I turned on the following features in zb-1 and zb-v:

zb-v: --enable-zero-bubble \ --zero-bubble-v-schedule \ --allow-padding-num-layers \ --zero-bubble-max-pending-backward $((1 * $PP)) \ --enable-optimizer-post-validation \

zb-1: --enable-zero-bubble \ --allow-padding-num-layers \ --zero-bubble-max-pending-backward $((1 * $PP)) \ --enable-optimizer-post-validation \

ufotalent commented 1 month ago

Hi, Thanks for the interest in our work. Theoretically ZBV has the same activation memory as 1F1B and zbh1, but one difference is that ZBV also changes the placement of layers. One thing that might cause the difference might be the lm-head and embedding, which for 1f1b is on different stage but for zbv on the same stage.

A quick calculation: For 1F1B and pp_stage=8, the peak memory is on rank 0 with 8x activation of 4 layers + parameter of 4 layers + parameter of input embedding. For ZBV, the peak memory is on rank 0 with 8x activation of 4 layers + parameter of 4 layers + parameter of input embedding + parameter of lm-head. I did a brief calculation that the parameter memory of lm-head for llama 7b is 16 (parameter + grad + optimzer state factor) h volcabulary = 2G, close to the difference between ZBV and baseline.

To verify this you can enlarge the mbs and you should see the memory difference between ZBV and baseline is a constant, because in this case only activation memory doubles.

Thanks!

ufotalent commented 1 month ago

BTW I feel that the acceleration ratio is lower than we expected, is it possible to share the logs so we can investigate a bit? Thanks

Edenzzzz commented 2 weeks ago

I think for 1F1B the activation is not 8x but 4x. ZBV has similar mem requirements as interleaved schedule, which is more than 1F1B due to more warm-up microbatches.