How does `tensor_parallel_size` increase throughput?

cassanof commented 1 year ago

Using 8 A100s (80GB), I find that setting this to 1 or 8 doesn't change performance much, even when using large batches (1000+). Is there a bottleneck somewhere that I am not aware of? The current workaround I'm using is to run 8 different processes with a single GPU each, as you would with accelerate, but it's a pretty ugly solution.

More context: This is using the LLM class, and StarCoderBase (tried other models too).

tweeter0830 commented 1 year ago

I think "tensor_parallel_size" means that the model itself is split across multiple GPUs. Not that you are running "tensor_parallel_size" number of models in parallel. As I understand it, this can introduce a lot of throughput issues because you now need to send data between the GPUs. NVLink is supposed to help, but you it's probably still slower than running the model on a single GPU. As i understand it, people use tensor parallelism only when they need to, for example if the model is too big to fit in a single GPU.

cassanof commented 1 year ago

@tweeter0830 that's the amazon sagemaker / megatron terminology, unsure if it applies here.

tweeter0830 commented 1 year ago

I think the terminology is the same based on documentation and the code here: https://github.com/vllm-project/vllm/blob/d189170b6c5a143e493c3f5cb7e8c976e8da62c7/vllm/model_executor/parallel_utils/parallel_state.py#L19

That sounds like model parallelism, not data parallelism.

SinclairCoder commented 1 year ago

So cam vllm help us do data parallelism, not model parallelism as my model only 1B?

nidhishs commented 1 year ago

Tagging @WoosukKwon for a more elegant solution. I am trying to create a cluster to match ChatGPT levels of throughput (i.e roughly 240k tokens/min). We ran some tests on the Phind/Phind-CodeLlama-34B-v2 model using the throughput benchmark script. Here are the results:

GPU Configuration	Requests per Second	Tokens per Second	Tokens per Minute
1xA100-40GB	OOM	OOM	OOM
2xA100-40GB	2.75	1315.86	78,000
4xA100-40GB	4.76	2277.93	136,000
8xA100-40GB	4.31	2061.69	124,000
1xA100-80GB	2.33	1115.13	67,000
2xA100-80GB	4.05	1936.34	116,000
4xA100-80GB	4.91	2346.69	140,000
8xA100-80GB	4.40	2102.29	126,000

The 8xA100 performance is worse than 4xA100, which is understandable if we are doing model sharding. What would be the best way to increase throughput in this case?

ishaan-jaff commented 11 months ago

@nidhishs @SinclairCoder @cassanof

If you're looking to Maximize LLM throughput LiteLLM now has a router to load balance requests (i'd love feedback if people on this thread are trying to do this)

Here's the quick start: doc: https://docs.litellm.ai/docs/simple_proxy#model-alias

Step 1 Create a Config.yaml

model_list:
  - model_name: zephyr-beta
    litellm_params:
        model: huggingface/HuggingFaceH4/zephyr-7b-beta
        api_base: http://0.0.0.0:8001
  - model_name: zephyr-beta
    litellm_params:
        model: huggingface/HuggingFaceH4/zephyr-7b-beta
        api_base: http://0.0.0.0:8002
  - model_name: zephyr-beta
    litellm_params:
        model: huggingface/HuggingFaceH4/zephyr-7b-beta
        api_base: http://0.0.0.0:8003

Step 2: Start the litellm proxy:

litellm --config /path/to/config.yaml

Step3 Make Request to LiteLLM proxy:

curl --location 'http://0.0.0.0:8000/chat/completions' \
--header 'Content-Type: application/json' \
--data ' {
      "model": "zephyr-beta",
      "messages": [
        {
          "role": "user",
          "content": "what llm are you"
        }
      ],
    }
'

cassanof commented 11 months ago

@ishaan-jaff

doesn't work for me. i need something low-level.

nidhishs commented 11 months ago

Hey @ishaan-jaff, I believe your demo might work for me. Would it be possible to get in touch to clarify some of the questions? I sent you a connection request on LinkedIn.

casper-hansen commented 11 months ago

Tagging @WoosukKwon for a more elegant solution. I am trying to create a cluster to match ChatGPT levels of throughput (i.e roughly 240k tokens/min). We ran some tests on the Phind/Phind-CodeLlama-34B-v2 model using the throughput benchmark script. Here are the results:

GPU Configuration Requests per Second Tokens per Second Tokens per Minute 1xA100-40GB OOM OOM OOM 2xA100-40GB 2.75 1315.86 78,000 4xA100-40GB 4.76 2277.93 136,000 8xA100-40GB 4.31 2061.69 124,000 1xA100-80GB 2.33 1115.13 67,000 2xA100-80GB 4.05 1936.34 116,000 4xA100-80GB 4.91 2346.69 140,000 8xA100-80GB 4.40 2102.29 126,000 The 8xA100 performance is worse than 4xA100, which is understandable if we are doing model sharding. What would be the best way to increase throughput in this case?

I think tensor parallelism is not the solution to increasing throughput currently because it does not scale well. On a 7B model, it did not help in my testing.

I would say KubeRay should be used. You should be able to design how your GPUs are used with KubeRay, i.e. you can specify how many replicas and how many GPUs each replica should have. That way, you can optimize throughput.

https://github.com/ray-project/kuberay

matrixssy commented 11 months ago

Tagging @WoosukKwon for a more elegant solution. I am trying to create a cluster to match ChatGPT levels of throughput (i.e roughly 240k tokens/min). We ran some tests on the Phind/Phind-CodeLlama-34B-v2 model using the throughput benchmark script. Here are the results:

GPU Configuration Requests per Second Tokens per Second Tokens per Minute 1xA100-40GB OOM OOM OOM 2xA100-40GB 2.75 1315.86 78,000 4xA100-40GB 4.76 2277.93 136,000 8xA100-40GB 4.31 2061.69 124,000 1xA100-80GB 2.33 1115.13 67,000 2xA100-80GB 4.05 1936.34 116,000 4xA100-80GB 4.91 2346.69 140,000 8xA100-80GB 4.40 2102.29 126,000 The 8xA100 performance is worse than 4xA100, which is understandable if we are doing model sharding. What would be the best way to increase throughput in this case?

Hi, I wonder how you get ~1k TPS performance in codellama34B. In my case, I get ~100 TPS in 2xA100-80GB. Is there any special options to run vllm server?

vllm-project / vllm