vllm-project / vllm

A high-throughput and memory-efficient inference and serving engine for LLMs
https://docs.vllm.ai
Apache License 2.0
29.51k stars 4.43k forks source link

[Usage]: Is there any way to hook features inside vision-language model? #7795

Open minuenergy opened 2 months ago

minuenergy commented 2 months ago

Your current environment

Initialize the LLaVA-1.5 model

llm = LLM(model="llava-hf/llava-1.5-7b-hf")

print(llm)

embed_last_hook = Hook(model.language_model.model.norm) # for save embed

Define the prompts and images

base_p = '../../../data/detect/coco/train2017' img_p1 = os.path.join(base_p, '000000265292.jpg') img_p2 = os.path.join(base_p, '000000318124.jpg') img_p3 = os.path.join(base_p, '000000370121.jpg')

prompts = [ {"prompt": "USER: \nWhat is the content of this image?\nASSISTANT:", "multi_modal_data": {"image": PIL.Image.open(img_p1)}}, {"prompt": "USER: \nWhat is the content of this image?\nASSISTANT:", "multi_modal_data": {"image": PIL.Image.open(img_p2)}}, {"prompt": "USER: \nWhat is the content of this image?\nASSISTANT:", "multi_modal_data": {"image": PIL.Image.open(img_p3)}} ]

Define sampling parameters

sampling_params = SamplingParams(temperature=0.0, top_p=1.0)

Generate outputs

outputs = llm.generate(prompts, sampling_params=sampling_params)

I want to add hook some features when llm's forward finished how can i get feature inside?

How would you like to use vllm

I want to run inference of a [specific model](put link here). I don't know how to integrate it with vllm.

jeejeelee commented 2 months ago

Perhaps you can try nn.Module's hook

minuenergy commented 2 months ago
import torch

class Hook:
    def __init__(self, module):
        self.hook = module.register_forward_hook(self.hook_fn)
        self.output = None

    def hook_fn(self, module, input, output):
        self.output = output

    def close(self):
        self.hook.remove()

def load_captioning_model(model_id):
    quantization_config = BitsAndBytesConfig(load_in_8bit=True)
    model = LlavaForConditionalGeneration.from_pretrained(model_id, low_cpu_mem_usage=True, quantization_config = quantization_config)
    processor = AutoProcessor.from_pretrained(model_id, pad_token="<pad>")
    return processor, model

model_id = "llava-hf/llava-1.5-7b-hf"
processor, model = load_captioning_model(model_id)
embed_last_hook = Hook(model.language_model.model.norm) # for save embed
embed_last_hook.output  # (1, 4096) 

When I use this code ( with HuggingFace ), i can get 1,4096's embed

But I got Another result below ( with vllm ), torch.Size([596, 4096])

from vllm import LLM, SamplingParams
llm = LLM(model="llava-hf/llava-1.5-7b-hf")
D_HOOK = Hook(llm.llm_engine.model_executor.driver_worker.model_runner.model.language_model.norm)
D_HOOK.output # (596, 4096) 

I want to get same features at huggingface. What should i do? and what is different between this ?

minuenergy commented 2 months ago

HuggingFace's model image

Vllm's model image