Open sleepwalker2017 opened 10 months ago
We do not currently support pipeline parallelism with MII.
We do not currently support pipeline parallelism with MII.
Thank you. I see this manual(https://www.deepspeed.ai/tutorials/pipeline/) for deepspeed, what is needed to manually implement pipeline parallelism in DeepSpeed?
The tutorial you linked provides an example of pipeline parallelism. However, the pipeline parallelism implemented in DeepSpeed is intended for training rather than inference. For inference we do model parallelism with tensor parallelism. Here is an example of how to do this with MII:
import mii
client = mii.serve("mistralai/Mistral-7B-v0.1", tensor_parallel=2)
Is there a specific reason you want to implement pipeline parallelism for inference?
The tutorial you linked provides an example of pipeline parallelism. However, the pipeline parallelism implemented in DeepSpeed is intended for training rather than inference. For inference we do model parallelism with tensor parallelism. Here is an example of how to do this with MII:
import mii client = mii.serve("mistralai/Mistral-7B-v0.1", tensor_parallel=2)
Is there a specific reason you want to implement pipeline parallelism for inference?
Thank you for the explanation. We may want to run LLM on multiple nodes each with multiple GPUs, the best solution may be PP between nodes and TP within nodes. Seems this feature is not widely supported by inference frameworks.
How large are the models you want to run? An alternative approach which you can try right now with MII is to have multiple model replicas with tensor parallelism. This would be similar to data parallelism + tensor parallelism. But the proper setup to get maximum performance from your hardware will likely depend on the model size.
import mii
client = mii.serve(model_name, replica_num=4, tensor_parallel=2)
How large are the models you want to run? An alternative approach which you can try right now with MII is to have multiple model replicas with tensor parallelism. This would be similar to data parallelism + tensor parallelism. But the proper setup to get maximum performance from your hardware will likely depend on the model size.
import mii client = mii.serve(model_name, replica_num=4, tensor_parallel=2)
Thank you. It's a Llama 13B model. Limited by the machine configuration, we only have two GPUs on each node. To support larger batches, we are trying this solution. By setting
replica_num=4
, the communication between nodes are ignored, right? Anyway, we will try to equip more GPUs on a single node.
If you do replica_num=4, tensor_parallel=2
on a 4 node setup (each with 2 GPU) there will still be some communication between nodes. The load balancer does a simple round-robin scheduling (e.g., request_1 will be sent to replica_1, request_2 will be sent to replica_2, etc.).
We don't have the multi-node scenario well documented. I'm currently working on building out our docs in #321, so look for updates soon! In the meantime, to get multi-node working, you will need to define a hostfile (default location is /job/hostfile
, but you can specify by passing hostfile=/path/to/hostfile
with mii.serve()
) that looks something like this:
node0 slots=2
node1 slots=2
node2 slots=2
node3 slots=2
Where the first value is the name of the node and the node can be accessed via passwordless ssh (e.g., ssh node1
) and the slots=2
indicates how many GPUs are available on that node.
We use the DeepSpeed launcher for multi-node, so take a look at the documentation here: https://www.deepspeed.ai/getting-started/#resource-configuration-multi-node
And please reach out if you run into any problems. We're still actively developing features and are happy to help!
If you do
replica_num=4, tensor_parallel=2
on a 4 node setup (each with 2 GPU) there will still be some communication between nodes. The load balancer does a simple round-robin scheduling (e.g., request_1 will be sent to replica_1, request_2 will be sent to replica_2, etc.).We don't have the multi-node scenario well documented. I'm currently working on building out our docs in #321, so look for updates soon! In the meantime, to get multi-node working, you will need to define a hostfile (default location is
/job/hostfile
, but you can specify by passinghostfile=/path/to/hostfile
withmii.serve()
) that looks something like this:node0 slots=2 node1 slots=2 node2 slots=2 node3 slots=2
Where the first value is the name of the node and the node can be accessed via passwordless ssh (e.g.,
ssh node1
) and theslots=2
indicates how many GPUs are available on that node.We use the DeepSpeed launcher for multi-node, so take a look at the documentation here: https://www.deepspeed.ai/getting-started/#resource-configuration-multi-node
And please reach out if you run into any problems. We're still actively developing features and are happy to help!
Thank you! I read the document and I'll try that if we decide to run inference on multiple nodes.
@mrwyattii Any update for #321 ? I have a similar scenario: two nodes with single A10 gpu on each node and we want to serve llama-13B with parallelism. Is this supported?
The tutorial you linked provides an example of pipeline parallelism. However, the pipeline parallelism implemented in DeepSpeed is intended for training rather than inference.
@mrwyattii just curious what's the implementation difference between training PP and inference PP? why PP implementation in training can not be used in inference?
@mrwyattii 您好,打扰了,不知道您是否实现了多节点推理
I didn't see any documentation that mentions that.