Open UncleFB opened 1 year ago
enabled_zero
turns on ZeRO-Inference (and uses the config in ds_config
) while enabled_deepspeed
uses DeepSpeed-Inference. Using DeepSpeed-Inference is the default behavior in MII. More on the differences between the two below:
ZeRO-Inference (latest release here) is primarily targeting cases where we want to run inference with very large models on very limited GPU memory. It takes advantage of ZeRO offloading capabilities to move most of the model weights to CPU memory (or even NVME storage). Because there is overhead associated with offloading weights, it is typically not well suited for use cases where low latency inference is a priority.
DeepSpeed-Inference is a separate engine that introduces lots of optimizations for running inference. For example, we support custom kernel injection on tens of thousands of models that can significantly improve latency and throughput. This will likely be your best bet for getting lowest latency when doing inference, but at the cost of needing much more GPU memory.
Thank you for your patient answer
I noticed that the deepspeed and zero switches are mutually exclusive. What is the difference between turning on zero and using ds_config to configure it, and using deepspeed directly?