pipilurj / bootstrapped-preference-optimization-BPO

code for "Strengthening Multimodal Large Language Model with Bootstrapped Preference Optimization"
Apache License 2.0
45 stars 1 forks source link

Questions about negative data. #6

Closed Ivesfu closed 2 months ago

Ivesfu commented 3 months ago

Thank you for your great work! I have a few questions regarding the data you open-sourced on HuggingFace.

  1. I understand that the data with type "mllm-hal" is constructed using the "image-weaken" method mentioned in your paper. Why is there only around 20k of this type of data, while the other type "llm-hal" amounts to 160k? Why is there such a significant difference in the number of data points between these two types?
  2. As I understand it, you randomly selected around 160k data points from the ShareGPT-V, LLaVAR, and LLaVA-Instruct datasets to construct negative data. If we denote the data used to construct llm-hal as A and the data used to construct mllm-hal as B, does B completely overlap with A? Based on my tests, the numbers do not match up. When I run the following code, the output is 167710 and 168180, indicating that these two parts do not completely overlap:
SSS = set()
with open(file_path, 'r') as f:
    datas = json.load(f)
    for data in datas:
        image = data['image']
        prompt = data['prompt']
        for cc in data['completions']:
            if cc['type'] == 'gt':
                SSS.add(cc['response'] + ' ' + prompt + ' ' + image)
print(57906 + 55445 + 54359)  # The sample sizes from the three datasets mentioned in Table 2 of your paper
print(len(SSS))

Could you please provide some clarification on these points? I look forward to your response. Thank you for your time and assistance.

pipilurj commented 3 months ago

Hi! Thanks a lot for your interest in our work!

The scale difference between image weakening and error injection is because by the time of this work, there was no acceleration package for MLLM inference, which makes the generation of image-weakened responses much slower. On the other hand, VLLM is able to greatly accelerate the generation of error injection responses, since they are completely based on LLM. Due to computational limitation, we chose to scale up error injection for negative responses.

However, we note that recently, there are acceleration packages for MLLMs, for example, LMdeploy: https://github.com/InternLM/lmdeploy

We will consider scaling up the size of image-weakened responses for better performance in the future.

Ivesfu commented 3 months ago

Thanks for your quick reply! Besides the above questions, I have one more question:

Did you start the training from LLaVA-v1.5-7b (after SFT) or from LLaVA-1.5-7b (after pretraining only)?

It seems that you trained LLaVA from LLaVA-1.5-7b (after pretraining only).

pipilurj commented 2 months ago

Hi! Sorry for the late response. Actually we start from the checkpoint after SFT. I have updated the training script accordingly. Please directly load from the SFT version of LLaVA. Sorry for the misunderstanding!