open-compass / VLMEvalKit

Open-source evaluation toolkit of large vision-language models (LVLMs), support 160+ VLMs, 50+ benchmarks
https://huggingface.co/spaces/opencompass/open_vlm_leaderboard
Apache License 2.0
1.34k stars 188 forks source link

How to evaluate my own model #413

Closed lixu6-alt closed 2 months ago

lixu6-alt commented 2 months ago

Hi, I am wondering how to evaluate a new model that developed by myself using VLMEvalKit? The README file does mention that I only need to create a function called inner_function() (maybe.. can't remeber the exact name), but does not provide any instruction about how to proceed. Can anybody help? Thanks.

BrenchCC commented 2 months ago

The simplest way is to first define a model base class to initialize the model, tokenizer, and chat template of your model. Then define a def generate_inner(self, message, dataset=None): function to output the answer. Note that this function is for a single question.

lixu6-alt commented 2 months ago

The simplest way is to first define a model base class to initialize the model, tokenizer, and chat template of your model. Then define a def generate_inner(self, message, dataset=None): function to output the answer. Note that this function is for a single question.

Thanks a lot for the timely response. Your explanation makes sense for me, but i am still wondering if there is any document in the git folder that teaches how to implement the generate_inner() function and what the inputs and ouputs are expected to be.

BrenchCC commented 2 months ago

For example, if you are using a model based on the llava-next architecture, then you need to ensure that your input's message meets form:

        conversation = [
            {
                'role': 'user',
                'content': content, 
            }
        ]

The input is passed into the model to generate the answer roughly in the following way:

        prompt = self.processor.apply_chat_template(conversation, add_generation_prompt=True)
        inputs = self.processor(prompt, images, return_tensors='pt').to('cuda', torch.float16)
        output = self.model.generate(**inputs, **self.kwargs)
        answer = self.processor.decode(output[0], skip_special_token=True)
        answer = self.output_process(answer)

Everything depends on your model architecture. You can refer to the vlmeval/vlm/mantis.py I created, which also includes how to build prompts, remove identifier tokens, and add based on the numbers of pictures.

BrenchCC commented 2 months ago

by the way, the images generally is a list containing the images in PIL decoded form, and also is RGB form.