vllm-project / vllm

A high-throughput and memory-efficient inference and serving engine for LLMs
https://docs.vllm.ai
Apache License 2.0
27.51k stars 4.05k forks source link

[Performance]: Talk about the model parallelism #8898

Open baifanxxx opened 8 hours ago

baifanxxx commented 8 hours ago

Proposal to improve performance

No response

Report of performance regression

No response

Misc discussion on performance

Hi,

Thank you for your contribution to the LLM community. I have a question about the model parallelism. One part of the model is loading with Tensor Parallel (TP) and the other part is not. For example, there is a model. TP is performed on the linear layer in the FFN in the model, but TP is not performed on other layers, such as the attention block. So, when loading the model into the GPU, how are weights without TP distributed to multiple GPUs? Is it copied repeatedly to each GPU? Or is it only loaded on a single GPU, such as GPU:0?

I'm not sure how this process happens in vllm. I would appreciate it if you could help me answer this question.

Best regards, BAI Fan

Your current environment (if you think it is necessary)

The output of `python collect_env.py`

Before submitting a new issue...

noooop commented 7 hours ago

There may be some terminology that is not very accurate, but the process is roughly like this.

When executing TP,

  1. first execute QKVParallelLinear, which is a ColumnParallelLinear, which will divide hidden_states into multiple blocks and distribute them to multiple cards.
  2. When doing Attention, each card will execute its own hidden_states without any synchronization.
  3. Next, do o_proj, which is a RowParallelLinear and can gather the contents of different graphics cards.

So basically all operations are performed TP and get multi-card acceleration.

baifanxxx commented 7 hours ago

Hi,

Thank you for your comments. However, I know that Attention and FFN both use TP. What I don't know is if only the FFN uses TP but not the Attention, that is, some weights in the model are TP, and other weights do not use TP, how can the weights without TP be allocated to multiple GPUs? Are layers without TP replicated to different GPUs or deployed on only one GPU in the vLLM framework?

You can see the example code here,

class InternAttention(nn.Module):
    """Multi-headed attention from 'Attention Is All You Need' paper"""

    def __init__(self, config: PretrainedConfig):
        super().__init__()
        self.config = config
        self.embed_dim = config.hidden_size
        self.num_heads = config.num_attention_heads
        self.head_dim = self.embed_dim // self.num_heads
        if self.head_dim * self.num_heads != self.embed_dim:
            raise ValueError(
                f'embed_dim must be divisible by num_heads '
                f'(got `embed_dim`: {self.embed_dim} and `num_heads`:'
                f' {self.num_heads}).')

        self.scale = self.head_dim**-0.5
        self.qkv = nn.Linear(self.embed_dim,
                             3 * self.embed_dim,
                             bias=config.qkv_bias)

        self.qk_normalization = config.qk_normalization

        if self.qk_normalization:
            self.q_norm = RMSNorm(self.embed_dim, eps=config.layer_norm_eps)
            self.k_norm = RMSNorm(self.embed_dim, eps=config.layer_norm_eps)

        self.proj = nn.Linear(self.embed_dim, self.embed_dim)

    def forward(self, x):
        B, N, C = x.shape
        qkv = self.qkv(x).reshape(B, N, 3, self.num_heads,
                                  C // self.num_heads).permute(2, 0, 3, 1, 4)
        q, k, v = qkv.unbind(0)

        if self.qk_normalization:
            B_, H_, N_, D_ = q.shape
            q = self.q_norm.forward_native(q.transpose(1, 2).flatten(
                -2, -1)).view(B_, N_, H_, D_).transpose(1, 2)
            k = self.k_norm.forward_native(k.transpose(1, 2).flatten(
                -2, -1)).view(B_, N_, H_, D_).transpose(1, 2)

        x = F.scaled_dot_product_attention(q, k, v, scale=self.scale)
        x = x.transpose(1, 2).reshape(B, N, C)

        x = self.proj(x)
        return x

class InternMLP(nn.Module):

    def __init__(self,
                 config: PretrainedConfig,
                 quant_config: Optional[QuantizationConfig] = None):
        super().__init__()
        self.config = config
        self.activation_fn = get_act_fn(config.hidden_act)
        self.fc1 = ColumnParallelLinear(config.hidden_size,
                                        config.intermediate_size,
                                        bias=True,
                                        quant_config=quant_config)
        self.fc2 = RowParallelLinear(config.intermediate_size,
                                     config.hidden_size,
                                     bias=True,
                                     quant_config=quant_config)

    def forward(self, hidden_states: torch.Tensor) -> torch.Tensor:
        hidden_states, _ = self.fc1(hidden_states)
        hidden_states = self.activation_fn(hidden_states)
        hidden_states, _ = self.fc2(hidden_states)

        return hidden_states

In this model, we can see that InternAttention is without TP but only InternMLP is with TP, like ColumnParallelLinear and RowParallelLinear. When the InternVisionEncoder (including InternAttention and InternMLP) loads on multiple GPUs, how to distribute the InternAttention without TP? Will the parameters in InternAttention be copied to multiple GPUs or loaded on only one GPU, such as GPU: 0?

noooop commented 7 hours ago

nn.Linear needs to be modified to ParallelLinear to obtain tp acceleration

F.scaled_dot_product_attention to be modified to vllm attention to obtain tp acceleration

optional-implement-tensor-parallelism-and-quantization-support

You can submit a pr to optimize the performance of this part

baifanxxx commented 6 hours ago

You misunderstood what I meant. I just want to discuss this phenomenon, in fact, after my experiments, not all model layers with TP are good. Sometimes, only part of the layer with TP may result in better inference speed. However, I would like to know how the layers without TP are allocated across multiple GPUs. This is related to the vLLM framework, and I don't know how vLLM would handle this situation.

noooop commented 1 hour ago

There is a driver_worker in vllm executor, which may be what you are talking about.

https://github.com/vllm-project/vllm/blob/a9b15c606fea67a072416ea0ea115261a2756058/vllm/executor/gpu_executor.py#L38C9-L38C51