Closed Ivesfu closed 2 months ago
Hi! Thanks a lot for your interest in our work!
The scale difference between image weakening and error injection is because by the time of this work, there was no acceleration package for MLLM inference, which makes the generation of image-weakened responses much slower. On the other hand, VLLM is able to greatly accelerate the generation of error injection responses, since they are completely based on LLM. Due to computational limitation, we chose to scale up error injection for negative responses.
However, we note that recently, there are acceleration packages for MLLMs, for example, LMdeploy: https://github.com/InternLM/lmdeploy
We will consider scaling up the size of image-weakened responses for better performance in the future.
Thanks for your quick reply! Besides the above questions, I have one more question:
Did you start the training from LLaVA-v1.5-7b (after SFT) or from LLaVA-1.5-7b (after pretraining only)?
It seems that you trained LLaVA from LLaVA-1.5-7b (after pretraining only).
Hi! Sorry for the late response. Actually we start from the checkpoint after SFT. I have updated the training script accordingly. Please directly load from the SFT version of LLaVA. Sorry for the misunderstanding!
Thank you for your great work! I have a few questions regarding the data you open-sourced on HuggingFace.
Could you please provide some clarification on these points? I look forward to your response. Thank you for your time and assistance.