Closed cassanof closed 7 months ago
I think "tensor_parallel_size" means that the model itself is split across multiple GPUs. Not that you are running "tensor_parallel_size" number of models in parallel. As I understand it, this can introduce a lot of throughput issues because you now need to send data between the GPUs. NVLink is supposed to help, but you it's probably still slower than running the model on a single GPU. As i understand it, people use tensor parallelism only when they need to, for example if the model is too big to fit in a single GPU.
@tweeter0830 that's the amazon sagemaker / megatron terminology, unsure if it applies here.
I think the terminology is the same based on documentation and the code here: https://github.com/vllm-project/vllm/blob/d189170b6c5a143e493c3f5cb7e8c976e8da62c7/vllm/model_executor/parallel_utils/parallel_state.py#L19
That sounds like model parallelism, not data parallelism.
So cam vllm help us do data parallelism, not model parallelism as my model only 1B?
Tagging @WoosukKwon for a more elegant solution. I am trying to create a cluster to match ChatGPT levels of throughput (i.e roughly 240k tokens/min). We ran some tests on the Phind/Phind-CodeLlama-34B-v2
model using the throughput benchmark script. Here are the results:
GPU Configuration | Requests per Second | Tokens per Second | Tokens per Minute |
---|---|---|---|
1xA100-40GB | OOM | OOM | OOM |
2xA100-40GB | 2.75 | 1315.86 | 78,000 |
4xA100-40GB | 4.76 | 2277.93 | 136,000 |
8xA100-40GB | 4.31 | 2061.69 | 124,000 |
1xA100-80GB | 2.33 | 1115.13 | 67,000 |
2xA100-80GB | 4.05 | 1936.34 | 116,000 |
4xA100-80GB | 4.91 | 2346.69 | 140,000 |
8xA100-80GB | 4.40 | 2102.29 | 126,000 |
The 8xA100 performance is worse than 4xA100, which is understandable if we are doing model sharding. What would be the best way to increase throughput in this case?
@nidhishs @SinclairCoder @cassanof
If you're looking to Maximize LLM throughput LiteLLM now has a router to load balance requests (i'd love feedback if people on this thread are trying to do this)
Here's the quick start: doc: https://docs.litellm.ai/docs/simple_proxy#model-alias
model_list:
- model_name: zephyr-beta
litellm_params:
model: huggingface/HuggingFaceH4/zephyr-7b-beta
api_base: http://0.0.0.0:8001
- model_name: zephyr-beta
litellm_params:
model: huggingface/HuggingFaceH4/zephyr-7b-beta
api_base: http://0.0.0.0:8002
- model_name: zephyr-beta
litellm_params:
model: huggingface/HuggingFaceH4/zephyr-7b-beta
api_base: http://0.0.0.0:8003
litellm --config /path/to/config.yaml
curl --location 'http://0.0.0.0:8000/chat/completions' \
--header 'Content-Type: application/json' \
--data ' {
"model": "zephyr-beta",
"messages": [
{
"role": "user",
"content": "what llm are you"
}
],
}
'
@ishaan-jaff
doesn't work for me. i need something low-level.
Hey @ishaan-jaff, I believe your demo might work for me. Would it be possible to get in touch to clarify some of the questions? I sent you a connection request on LinkedIn.
Tagging @WoosukKwon for a more elegant solution. I am trying to create a cluster to match ChatGPT levels of throughput (i.e roughly 240k tokens/min). We ran some tests on the
Phind/Phind-CodeLlama-34B-v2
model using the throughput benchmark script. Here are the results:GPU Configuration Requests per Second Tokens per Second Tokens per Minute 1xA100-40GB OOM OOM OOM 2xA100-40GB 2.75 1315.86 78,000 4xA100-40GB 4.76 2277.93 136,000 8xA100-40GB 4.31 2061.69 124,000 1xA100-80GB 2.33 1115.13 67,000 2xA100-80GB 4.05 1936.34 116,000 4xA100-80GB 4.91 2346.69 140,000 8xA100-80GB 4.40 2102.29 126,000 The 8xA100 performance is worse than 4xA100, which is understandable if we are doing model sharding. What would be the best way to increase throughput in this case?
I think tensor parallelism is not the solution to increasing throughput currently because it does not scale well. On a 7B model, it did not help in my testing.
I would say KubeRay should be used. You should be able to design how your GPUs are used with KubeRay, i.e. you can specify how many replicas and how many GPUs each replica should have. That way, you can optimize throughput.
Tagging @WoosukKwon for a more elegant solution. I am trying to create a cluster to match ChatGPT levels of throughput (i.e roughly 240k tokens/min). We ran some tests on the
Phind/Phind-CodeLlama-34B-v2
model using the throughput benchmark script. Here are the results:GPU Configuration Requests per Second Tokens per Second Tokens per Minute 1xA100-40GB OOM OOM OOM 2xA100-40GB 2.75 1315.86 78,000 4xA100-40GB 4.76 2277.93 136,000 8xA100-40GB 4.31 2061.69 124,000 1xA100-80GB 2.33 1115.13 67,000 2xA100-80GB 4.05 1936.34 116,000 4xA100-80GB 4.91 2346.69 140,000 8xA100-80GB 4.40 2102.29 126,000 The 8xA100 performance is worse than 4xA100, which is understandable if we are doing model sharding. What would be the best way to increase throughput in this case?
Hi, I wonder how you get ~1k TPS performance in codellama34B. In my case, I get ~100 TPS in 2xA100-80GB. Is there any special options to run vllm server?
Using 8 A100s (80GB), I find that setting this to 1 or 8 doesn't change performance much, even when using large batches (1000+). Is there a bottleneck somewhere that I am not aware of? The current workaround I'm using is to run 8 different processes with a single GPU each, as you would with
accelerate
, but it's a pretty ugly solution.More context: This is using the
LLM
class, and StarCoderBase (tried other models too).