How to understand this note: "note: Since Deepspeed-ZeRO can process multiple generate streams in parallel its throughput can be further divided by 8 or 16 ..."

huggingface / transformers-bloom-inference

Fast Inference Solutions for BLOOM

Apache License 2.0

561 stars 114 forks source link

Hey, ds-inference is also doing world_size streams However, accelerate is only doing 1 stream since we are just using naive pipeline parallelism capability from accelerate. A more efficient approach for pipeline parallelism could be overlapping microbatches in the forward pass (no backward pass is needed)

For example, check this image from the Megatron-LM paper. This would be more efficient when serving. I think this will require you to have multiple processes for implementing this. But you might still get better throughout using DS-inference.

Also, if you are really interested in exploring serving models, I would suggest using text-gen-inference. This does dynamic batching and is much more efficient.

huggingface / transformers-bloom-inference

How to understand this note: "note: Since Deepspeed-ZeRO can process multiple generate streams in parallel its throughput can be further divided by 8 or 16 ..." #99