lm-sys / FastChat

An open platform for training, serving, and evaluating large language models. Release repo for Vicuna and Chatbot Arena.
Apache License 2.0
35.77k stars 4.41k forks source link

Slower inference with vLLM worker on 4 A100 #2746

Open tacacs1101-debug opened 7 months ago

tacacs1101-debug commented 7 months ago

I deployed wizardLLM-70b which is fine-tuned variant of llama2-70b on 4 A100 (80 GB) using vLLM worker. I noticed a much slower response (more than a minute even for a simple prompt like Hi) at a throughput of 0.2 tok/sec . My tensor-parallelism was set to 4 in this case.

When I deployed the same model with 2 A100 (80GB). I noticed a much higher throughput and lower latency. I achieved throughput ~700 tok/sec . Why this is so. I assumed that using 4 A100 will deliver much higher throughput and lower latency because tensor-parallelism in this case is 4 and also I have a lot of GPU KV cache in this case. Do anyone have any explaination or am I doing something wrong.

surak commented 7 months ago

While that would be true for games, for LLMs is not true that more GPUs == more performance. Turns out that there's a lot of data movement going on among different areas of a model during inference, and this goes through the PCI express or NVLINK, which is orders of magnitude slower than movements in ram.

Check with smaller models, try the same thing: one gpu, two and then four. You will see a drastic performance reduction when you scale your rig.

ishaan-jaff commented 7 months ago

@tacacs1101-debug @surak

I'm the maintainer of LiteLLM we provide an Open source proxy for load balancing vLLM + Azure + OpenAI. It can process (500+ requests/second)

From the thread it looks like you're trying to maximize throughput (i'd love feedback if you're trying to do this)

Here's the quick start:

Doc: https://docs.litellm.ai/docs/simple_proxy#load-balancing---multiple-instances-of-1-model

Step 1 Create a Config.yaml

model_list:
  - model_name: gpt-4
    litellm_params:
      model: azure/chatgpt-v-2
      api_base: https://openai-gpt-4-test-v-1.openai.azure.com/
      api_version: "2023-05-15"
      api_key: 
  - model_name: gpt-4
    litellm_params:
      model: azure/gpt-4
      api_key: 
      api_base: https://openai-gpt-4-test-v-2.openai.azure.com/
  - model_name: gpt-4
    litellm_params:
      model: azure/gpt-4
      api_key: 
      api_base: https://openai-gpt-4-test-v-2.openai.azure.com/

Step 2: Start the litellm proxy:

litellm --config /path/to/config.yaml

Step3 Make Request to LiteLLM proxy:

curl --location 'http://0.0.0.0:8000/chat/completions' \
--header 'Content-Type: application/json' \
--data ' {
      "model": "gpt-4",
      "messages": [
        {
          "role": "user",
          "content": "what llm are you"
        }
      ],
    }
'
surak commented 7 months ago

I am not sure this applies here. The OP is talking about local inference on a single compute node with 4gpus. Are we talking about the same thing?

infwinston commented 7 months ago

a throughput of 0.2 tok/sec .

@tacacs1101-debug this doesn't seem correct. can you provide commands to reproduce?

tacacs1101-debug commented 7 months ago

@surak Absolutely correct, I am talking about local inference on single compute node with 4 A100 (80 Gi). Our throughput is good even on 2 A100 but my assumption was that using 4 A100's, we can increase the degree of tensor parallelism to 4 and that would translate to some reduction in latency. I have also noted that using 3 A100, there is no such difference in throughput or latency and third gpu is almost unutilised as in this case I have to forcibly set degree of tensor parallelism to 2. I understand that increasing the number of GPU doesn't translate to increase in performance due to the GPU overhead but it should not drop as drastically as from ~500/tok sec to 0.2 token/sec .

tacacs1101-debug commented 7 months ago

@infwinston I am using Helm chart for deployment by creating custom docker image but the command is similar to

python3 -m fastchat.serve.cli --model-path WizardLM/WizardLM-70B-V1.0 --num-gpus 4

infwinston commented 7 months ago

this command does not use vllm so it will be slow.

python3 -m fastchat.serve.cli --model-path WizardLM/WizardLM-70B-V1.0 --num-gpus 4

you have to use vllm worker for better tensor parallelism speed. see https://github.com/lm-sys/FastChat/blob/main/docs/vllm_integration.md

or did I misunderstand?

tacacs1101-debug commented 7 months ago

@infwinston Actually I mentioned the wrong command. The corresponding command is python3 -m fastchat.serve.vllm_worker --model-path WizardLM/WizardLM-70B-V1.0 --num-gpus 4

ruifengma commented 3 months ago

@infwinston Actually I mentioned the wrong command. The corresponding command is python3 -m fastchat.serve.vllm_worker --model-path WizardLM/WizardLM-70B-V1.0 --num-gpus 4

how's the inference speed?