Why not just use zero3 inference to generate sequence in DeepSpeed Chat stage3 training?

microsoft / DeepSpeedExamples

Example models using DeepSpeed

Apache License 2.0

6.12k stars 1.05k forks source link

Why not just use zero3 inference to generate sequence in DeepSpeed Chat stage3 training? #760

Open LSC527 opened 1 year ago

LSC527 commented 1 year ago

DeepSpeed Chat use tensor parallelism via hybrid engine to generate sequence in stage3 training. I wonder if just use zero3 inference for generation is ok? So that we don't need to transform model params between train and eval mode. Any explanations about the design of stage3 training would be appreciated. Thanks.

LSC527 commented 1 year ago

Sry to bother, I review the doc and I see that this transform is used to speedup generation process. But why not use tensor parallelism for both training and generation?

tjruwase commented 1 year ago

TP alone would be insufficient to fit large models for training, and so we need zero3 and in some cases offloading. TP could be combined with zero3 for training, but that requires TP support in the models, which does not exist for the HF models used in our examples. Our current (automatic) TP design used in hybrid engine is not yet general enough to support training.

LSC527 commented 1 year ago

BIG thanks for your reply @tjruwase! BTW, why is hybrid engine not supported for llama2 70b yet? Is it because the use of group query attention in llama2 70b?