vllm-project / vllm

A high-throughput and memory-efficient inference and serving engine for LLMs
https://docs.vllm.ai
Apache License 2.0
26.09k stars 3.82k forks source link

[Usage]:how to get the output embedding for a text generation model using vllm #5266

Open Apricot1225 opened 3 months ago

Apricot1225 commented 3 months ago

Your current environment

Referring to the issue #5181 "The Offline Inference Embedding Example Fails", the method LLM.encode() can only work for embedding models. Is there any idea to get the output embedding for a text generation model("XXXForCausalLM" in model config) using vllm?

How would you like to use vllm

I want to get the output embedding for a text generation model("XXXForCausalLM" in model config) . I don't know how to integrate it with vllm.

Elanmarkowitz commented 2 months ago

This is of interest to me as well. I would either like to be able to get the last hidden state of the last token in the sequence. Or, ideally, can do this while generating text (get last hidden state before EOS token is generated, or all last layer hidden states)

Is this possible at the moment?

Apricot1225 commented 2 months ago

This is of interest to me as well. I would either like to be able to get the last hidden state of the last token in the sequence. Or, ideally, can do this while generating text (get last hidden state before EOS token is generated, or all last layer hidden states)

Is this possible at the moment?

I have tried some effort on it and I think it might be possible. Take Minicpm for an example.

In vllm/model_executor/models/minicpm.py L375 I can get the embedding of input tokens with the dimension of [input_token_length, hidden_size]. In L387 I can get the output embedding of each output token after 40 layers(where 40 is the layer number of Minicpm) with the dimension of [1, hidden_size].

Then I packed all the output token embedding into a [output_token_length, 1, hidden_size] tensor and add it into the return of LLM.generate() function in vllm/entrypoints/llm.py step by step to get the embedding along with the generated text. I hope this output is right and of use.

HaomingX commented 2 months ago

Hello,could u share ur code here or tell me the details. Also I want to know that the model execution is asynchronous, is it correct to only modify the hidden_state returned in the model definition file. Thanks!

qingquansong commented 1 month ago

One way I'm using is the register the new model (based on the original Causal model without change anything from model side but just config) as the embedding model. Using tinyllama as example: 1) Change the TinyLlama-1.1B-Chat-v1.0/config.json architecture name to a new name: { "architectures": [ "TinyLlamaEmbModel" ], …… }

2) Implement the script similar to E5 mistral for the tinyllama:

from typing import Iterable, List, Optional, Tuple

import torch
from torch import nn
from vllm.attention import AttentionMetadata
from vllm.model_executor.layers.pooler import Pooler, PoolingType
from vllm.model_executor.model_loader.weight_utils import default_weight_loader
from vllm.model_executor.models.llama import LlamaModel
from vllm.model_executor.pooling_metadata import PoolingMetadata
from vllm.sequence import PoolerOutput

class TinyLlamaEmbeddingModel(nn.Module):
    """A model that uses Llama with additional embedding functionalities.

    This class encapsulates the LlamaModel and provides an interface for
    embedding operations and customized pooling functions.

    Attributes:
        model: An instance of LlamaModel used for forward operations.
        _pooler: An instance of Pooler used for pooling operations.
    """

    def __init__(
        self,
        **kwargs,
    ) -> None:
        super().__init__()
        self.model = LlamaModel(**kwargs)
        self._pooler = Pooler(pooling_type=PoolingType.LAST, normalize=True)

    def forward(
        self,
        input_ids: Optional[torch.Tensor],
        positions: torch.Tensor,
        kv_caches: List[torch.Tensor],
        attn_metadata: AttentionMetadata,
        inputs_embeds: Optional[torch.Tensor] = None,
    ) -> torch.Tensor:
        return self.model.forward(
            input_ids, positions, kv_caches, attn_metadata, inputs_embeds
        )

    def pooler(
        self,
        hidden_states: torch.Tensor,
        pooling_metadata: PoolingMetadata,
    ) -> Optional[PoolerOutput]:
        return self._pooler(hidden_states, pooling_metadata)

    def load_weights(self, weights: Iterable[Tuple[str, torch.Tensor]]):
        stacked_params_mapping = [
            # (param_name, shard_name, shard_id)
            ("qkv_proj", "q_proj", "q"),
            ("qkv_proj", "k_proj", "k"),
            ("qkv_proj", "v_proj", "v"),
            ("gate_up_proj", "gate_proj", 0),
            ("gate_up_proj", "up_proj", 1),
        ]
        params_dict = dict(self.model.named_parameters())
        for name, loaded_weight in weights:
            name = name.replace("model.", "")
            if "rotary_emb.inv_freq" in name:
                continue
            if "rotary_emb.cos_cached" in name or "rotary_emb.sin_cached" in name:
                # Models trained using ColossalAI may include these tensors in
                # the checkpoint. Skip them.
                continue
            for param_name, weight_name, shard_id in stacked_params_mapping:
                if weight_name not in name:
                    continue
                name = name.replace(weight_name, param_name)
                # Skip loading extra bias for GPTQ models.
                if name.endswith(".bias") and name not in params_dict:
                    continue
                param = params_dict[name]
                weight_loader = param.weight_loader
                weight_loader(param, loaded_weight, shard_id)
                break
            else:
                # Skip loading extra bias for GPTQ models.
                if (
                    name.endswith(".bias")
                    and name not in params_dict
                    or name == "lm_head.weight"
                ):
                    continue
                param = params_dict[name]
                weight_loader = getattr(param, "weight_loader", default_weight_loader)
                weight_loader(param, loaded_weight)

3) Register the model: we need to register to both _OOT_MODEL and the _EMBEDDING_MODELS otherwise will have issues for selecting the model runner

from vllm import ModelRegistry
from tiny_llama_embedding import TinyLlamaEmbeddingModel
ModelRegistry.register_model("TinyLlamaEmbModel",TinyLlamaEmbeddingModel)

from vllm.model_executor.models import _EMBEDDING_MODELS
global _EMBEDDING_MODELS
_EMBEDDING_MODELS["TinyLlamaEmbModel"] = TinyLlamaEmbeddingModel

cc @simon-mo @robertgshaw2-neuralmagic The model register part seems to be worthy to add as currently can only register causal model.

arynoot commented 1 month ago

One way I'm using is the register the new model (based on the original Causal model without change anything from model side but just config) as the embedding model. Using tinyllama as example:

  1. Change the TinyLlama-1.1B-Chat-v1.0/config.json architecture name to a new name: { "architectures": [ "TinyLlamaEmbModel" ], …… }

does it mean you want to have HuffingFace model locally with changed config?

qingquansong commented 1 month ago

@arynoot Yes, just for registering a new model into the embedding model (differentiate with original causal model). (name should be aligned with your register model name)

from vllm import ModelRegistry
from tiny_llama_embedding import TinyLlamaEmbeddingModel
ModelRegistry.register_model("TinyLlamaEmbModel",TinyLlamaEmbeddingModel)

from vllm.model_executor.models import _EMBEDDING_MODELS
global _EMBEDDING_MODELS
_EMBEDDING_MODELS["TinyLlamaEmbModel"] = TinyLlamaEmbeddingModel