Open LSC527 opened 1 year ago
Sry to bother, I review the doc and I see that this transform is used to speedup generation process. But why not use tensor parallelism for both training and generation?
TP alone would be insufficient to fit large models for training, and so we need zero3 and in some cases offloading. TP could be combined with zero3 for training, but that requires TP support in the models, which does not exist for the HF models used in our examples. Our current (automatic) TP design used in hybrid engine is not yet general enough to support training.
BIG thanks for your reply @tjruwase! BTW, why is hybrid engine not supported for llama2 70b yet? Is it because the use of group query attention in llama2 70b?
DeepSpeed Chat use tensor parallelism via hybrid engine to generate sequence in stage3 training. I wonder if just use zero3 inference for generation is ok? So that we don't need to transform model params between train and eval mode. Any explanations about the design of stage3 training would be appreciated. Thanks.