shikiw / OPERA

[CVPR 2024 Highlight] OPERA: Alleviating Hallucination in Multi-Modal Large Language Models via Over-Trust Penalty and Retrospection-Allocation
MIT License
244 stars 22 forks source link

can you add support for other models like QWEN-VL in the future? #9

Closed chuangzhidan closed 5 months ago

chuangzhidan commented 5 months ago

can you add support for other models like QWEN-VL in the future? beacuse it supports mutil languagesv :)

shikiw commented 5 months ago

Hi,

Actually, your can follow the "TL;DR" section to use OPERA on Qwen-VL. You just need to install the "transformer" package we provided and follow the instructions in the "TL;DR" section.

Feel free to ask me if you have any question :)

chuangzhidan commented 5 months ago

thank you for your prompt reply,i read the "TL;DR" section you mentioned ,and issues section too , yet when i got to the qwen-vl repo ,i still do not know how to apply this awesome tool to my trained model. can i kindly bother you to show us a step-by-step example about how it's done? looking forward to your reply!

chuangzhidan commented 5 months ago

much appreciated if you can reach out to help

shikiw commented 5 months ago

Sorry for the delay in replying. Here is the steps you can follow based on your Qwen's environment:

  1. Check the transformer package in Qwen's Anaconda environment, and find the file at transformers/generation/utils.py. (remember to backup the copy)

  2. Replace this file with the utils.py file in our repo transformers-4.29.2/src/transformers/generation/utils.py

  3. Now you can add opera_decoding=True in model.generate() function like it is used in our "TL;DR" section. Maybe you should locate where to use this generate function in Qwen's codebase. If you use their model.chat function, you can find the generate function is called when using model.chat.

I hope these steps can help you well :)

chuangzhidan commented 5 months ago

thank you ,I finished step 1 and step 2 like you told. and now ,yes, i'm using their model.chat function,:) in modeling_qwen.py ,i found
QWenLMHeadModel(QWenPreTrainedModel):

def chat(
    self,
    tokenizer: PreTrainedTokenizer,

······ input_ids = torch.tensor([context_tokens]).to(self.device) outputs = self.generate( input_ids, stop_words_ids=stop_words_ids, return_dict_in_generate=False, generation_config=generation_config, **kwargs, )

    response = decode_tokens(
        outputs[0],
        tokenizer,
        raw_text_len=len(raw_text),
        context_length=len(context_tokens),
        chat_format=generation_config.chat_format,
        verbose=False,
        errors='replace'
    )

    if append_history:
        history.append((query, response))

    return response, history

this is where i use this generate function,right?

so ,the final problem is : how and where do i insert the following?if you are kind and patient enough to help ^_^

START_INDEX_of_IMAGE_TOKENS = END_INDEX_of_IMAGE_TOKENS = NUM_of_TOKENS_IN_THE_PROMPT = <the total number of tokens in the user prompt (including image tokens)>

key_position = { "image_start": START_INDEX_of_IMAGE_TOKENS, "image_end": END_INDEX_of_IMAGE_TOKENS, "response_start": NUM_of_TOKENS_IN_THE_PROMPT, } KWOF)Q CZ0BDVJ(XA 8HJU9

chuangzhidan commented 5 months ago

maybe the trick quetion is ,when i provide model with an image and a query, how should I calculate and pass these three parameters to the chat function?

shikiw commented 5 months ago

These three parameters should be defined before calling the self.generate function.

Usually, you can obtain these three parameters based on input_ids. You can check it or other hyper parameters in Qwen-VL, i.e., you can obtain them rather than calculate them outside and pass them to the chat function.

chuangzhidan commented 5 months ago

tried ,sadly ,it seems the output doesn't improve much ,do not know whats wrong. appreciate your patience before, you are most kind

shikiw commented 5 months ago

Thanks for your valuable feedback! We will try our best to fix it in the next version.

zhongtao93 commented 4 months ago

@chuangzhidan Hi, can you show some code about how to calculate the key_position. I using the code like this, but get some runtime bug bos_pos = torch.where(input_ids == model.config.visual['image_start_id']) eos_pos = torch.where(input_ids == model.config.visual['image_start_id'] + 1) stop_pos = torch.where(input_ids == 151644) bos_pos, eos_pos, stop_pos = int(bos_pos[1][0]), int(eos_pos[1][0]), int(stop_pos[1][-1]) key_position = { "image_start": bos_pos, "image_end": eos_pos, "response_start": stop_pos, }