Open YuWang916 opened 4 months ago
@YuWang916 did you figure this out? Working on the same thing. cc: @simon-mo
checkout the Ray data example here, which does sharding automatically for you.
https://github.com/vllm-project/vllm/blob/main/examples/offline_inference_distributed.py
Thanks @simon-mo - I did see that example. What I'm curious about is whether you can do both tensor parallelism and data parallelism together. For example: if I'm loading a 70b parameter model onto 4x 40GB GPU's, can I both use tensor parallelism to split the model across the GPU's, and have 4 vllm instantiations - 1 per gpu? Intuitively it seems like the answer should be no, but you know may better.
Once TP is enabled the model is sharded across GPU and won't have the full weights per rank. Curious to hear why would be the reason behind this setup?
I want to maximize throughput for batch use cases. I figured if vllm sits in one GPU and has access to weights across all four (in above case), it could theoretically be parallelized (probably something less than a 4x gain because of clock cycles). Am I crazy or is something like this possible?
Your current environment
How would you like to use vllm
I want to run offline inference of model mistralai/Mixtral-8x22B-v0.1. I have multiple nodes and within each node, there are 8 A100 GPUs. My data is pretty big, and I would like to run offline inference using multiple nodes, so that
data parellel size = # of nodes
)tensor_parallel_size=8
)