microsoft / DeepSpeed-MII

MII makes low-latency and high-throughput inference possible, powered by DeepSpeed.
Apache License 2.0
1.85k stars 174 forks source link

Is pipeline parallelism supported? #329

Open sleepwalker2017 opened 10 months ago

sleepwalker2017 commented 10 months ago

I didn't see any documentation that mentions that.

mrwyattii commented 10 months ago

We do not currently support pipeline parallelism with MII.

sleepwalker2017 commented 10 months ago

We do not currently support pipeline parallelism with MII.

Thank you. I see this manual(https://www.deepspeed.ai/tutorials/pipeline/) for deepspeed, what is needed to manually implement pipeline parallelism in DeepSpeed?

mrwyattii commented 10 months ago

The tutorial you linked provides an example of pipeline parallelism. However, the pipeline parallelism implemented in DeepSpeed is intended for training rather than inference. For inference we do model parallelism with tensor parallelism. Here is an example of how to do this with MII:

import mii
client = mii.serve("mistralai/Mistral-7B-v0.1", tensor_parallel=2)

Is there a specific reason you want to implement pipeline parallelism for inference?

sleepwalker2017 commented 10 months ago

The tutorial you linked provides an example of pipeline parallelism. However, the pipeline parallelism implemented in DeepSpeed is intended for training rather than inference. For inference we do model parallelism with tensor parallelism. Here is an example of how to do this with MII:

import mii
client = mii.serve("mistralai/Mistral-7B-v0.1", tensor_parallel=2)

Is there a specific reason you want to implement pipeline parallelism for inference?

Thank you for the explanation. We may want to run LLM on multiple nodes each with multiple GPUs, the best solution may be PP between nodes and TP within nodes. Seems this feature is not widely supported by inference frameworks.

mrwyattii commented 10 months ago

How large are the models you want to run? An alternative approach which you can try right now with MII is to have multiple model replicas with tensor parallelism. This would be similar to data parallelism + tensor parallelism. But the proper setup to get maximum performance from your hardware will likely depend on the model size.

import mii
client = mii.serve(model_name, replica_num=4, tensor_parallel=2)
sleepwalker2017 commented 10 months ago

How large are the models you want to run? An alternative approach which you can try right now with MII is to have multiple model replicas with tensor parallelism. This would be similar to data parallelism + tensor parallelism. But the proper setup to get maximum performance from your hardware will likely depend on the model size.

import mii
client = mii.serve(model_name, replica_num=4, tensor_parallel=2)

Thank you. It's a Llama 13B model. Limited by the machine configuration, we only have two GPUs on each node. To support larger batches, we are trying this solution. By setting replica_num=4, the communication between nodes are ignored, right? Anyway, we will try to equip more GPUs on a single node.

mrwyattii commented 10 months ago

If you do replica_num=4, tensor_parallel=2 on a 4 node setup (each with 2 GPU) there will still be some communication between nodes. The load balancer does a simple round-robin scheduling (e.g., request_1 will be sent to replica_1, request_2 will be sent to replica_2, etc.).

We don't have the multi-node scenario well documented. I'm currently working on building out our docs in #321, so look for updates soon! In the meantime, to get multi-node working, you will need to define a hostfile (default location is /job/hostfile, but you can specify by passing hostfile=/path/to/hostfile with mii.serve()) that looks something like this:

node0 slots=2
node1 slots=2
node2 slots=2
node3 slots=2

Where the first value is the name of the node and the node can be accessed via passwordless ssh (e.g., ssh node1) and the slots=2 indicates how many GPUs are available on that node.

We use the DeepSpeed launcher for multi-node, so take a look at the documentation here: https://www.deepspeed.ai/getting-started/#resource-configuration-multi-node

And please reach out if you run into any problems. We're still actively developing features and are happy to help!

sleepwalker2017 commented 10 months ago

If you do replica_num=4, tensor_parallel=2 on a 4 node setup (each with 2 GPU) there will still be some communication between nodes. The load balancer does a simple round-robin scheduling (e.g., request_1 will be sent to replica_1, request_2 will be sent to replica_2, etc.).

We don't have the multi-node scenario well documented. I'm currently working on building out our docs in #321, so look for updates soon! In the meantime, to get multi-node working, you will need to define a hostfile (default location is /job/hostfile, but you can specify by passing hostfile=/path/to/hostfile with mii.serve()) that looks something like this:

node0 slots=2
node1 slots=2
node2 slots=2
node3 slots=2

Where the first value is the name of the node and the node can be accessed via passwordless ssh (e.g., ssh node1) and the slots=2 indicates how many GPUs are available on that node.

We use the DeepSpeed launcher for multi-node, so take a look at the documentation here: https://www.deepspeed.ai/getting-started/#resource-configuration-multi-node

And please reach out if you run into any problems. We're still actively developing features and are happy to help!

Thank you! I read the document and I'll try that if we decide to run inference on multiple nodes.

cermeng commented 8 months ago

@mrwyattii Any update for #321 ? I have a similar scenario: two nodes with single A10 gpu on each node and we want to serve llama-13B with parallelism. Is this supported?

Jeffwan commented 6 months ago

The tutorial you linked provides an example of pipeline parallelism. However, the pipeline parallelism implemented in DeepSpeed is intended for training rather than inference.

@mrwyattii just curious what's the implementation difference between training PP and inference PP? why PP implementation in training can not be used in inference?

JKYtydt commented 3 months ago

@mrwyattii 您好,打扰了,不知道您是否实现了多节点推理