huggingface / transformers-bloom-inference

Fast Inference Solutions for BLOOM
Apache License 2.0
561 stars 114 forks source link

How to understand this note: "note: Since Deepspeed-ZeRO can process multiple generate streams in parallel its throughput can be further divided by 8 or 16 ..." #99

Open HuipengXu opened 1 year ago

HuipengXu commented 1 year ago

Why is it said that only ds_zero is currently doing world_size streams on world_size gpus, while acclerate and ds inference should be doing the same as well since they also use multiprocessing?

mayank31398 commented 1 year ago

Hey, ds-inference is also doing world_size streams However, accelerate is only doing 1 stream since we are just using naive pipeline parallelism capability from accelerate. A more efficient approach for pipeline parallelism could be overlapping microbatches in the forward pass (no backward pass is needed)

For example, check this image from the Megatron-LM paper. This would be more efficient when serving. I think this will require you to have multiple processes for implementing this. But you might still get better throughout using DS-inference.

Screenshot 2023-06-15 at 12 05 46 PM

Also, if you are really interested in exploring serving models, I would suggest using text-gen-inference. This does dynamic batching and is much more efficient.