magic-research / PLLaVA

Official repository for the paper PLLaVA
578 stars 40 forks source link

Why are the results in the SOTA table not consistent with ablation studies? #18

Open takfate opened 6 months ago

takfate commented 6 months ago

Hello, thanks for your great work. I read your paper, but have some confusion about the results. I find the VCG scores are not higher than 3.0 in your ablation studies, but the performance of the 7B model is 3.12. Could you help me?

zhoudaquan commented 6 months ago

Hello, thanks for your great work. I read your paper, but have some confusion about the results. I find the VCG scores are not higher than 3.0 in your ablation studies, but the performance of the 7B model is 3.12. Could you help me?

Hi,

Thanks for your interest. To save the computation, in the ablation of the impacts of pooling operation, we test the model under zero-shot setting: that is to say, the model are not trained on video dataset. We have verified that the zero-shot testing results are good indicators of the trained model.

I hope this clarify your question.

Best regards, DQ

takfate commented 6 months ago

Thank you for your response. I've also tried adapting LLaVA to the video domain, but in my experiments, the performance in open-ended QA is significantly lower compared to PLLaVA. Could you share some tips or tricks? I trained the model for just one epoch and am wondering if the lower performance is related to the number of training epochs or if there are other factors involved?

takfate commented 6 months ago

The other confusion is in figure 9 about training LoRA with video samples. In this figure, the best result of the 7B model on VCG is not also higher than 3.0. Could you clear up my confusion?