Open HuipengXu opened 1 year ago
Hey, ds-inference is also doing world_size streams However, accelerate is only doing 1 stream since we are just using naive pipeline parallelism capability from accelerate. A more efficient approach for pipeline parallelism could be overlapping microbatches in the forward pass (no backward pass is needed)
For example, check this image from the Megatron-LM paper. This would be more efficient when serving. I think this will require you to have multiple processes for implementing this. But you might still get better throughout using DS-inference.
Also, if you are really interested in exploring serving models, I would suggest using text-gen-inference. This does dynamic batching and is much more efficient.
Why is it said that only ds_zero is currently doing world_size streams on world_size gpus, while acclerate and ds inference should be doing the same as well since they also use multiprocessing?