Open takfate opened 6 months ago
Hello, thanks for your great work. I read your paper, but have some confusion about the results. I find the VCG scores are not higher than 3.0 in your ablation studies, but the performance of the 7B model is 3.12. Could you help me?
Hi,
Thanks for your interest. To save the computation, in the ablation of the impacts of pooling operation, we test the model under zero-shot setting: that is to say, the model are not trained on video dataset. We have verified that the zero-shot testing results are good indicators of the trained model.
I hope this clarify your question.
Best regards, DQ
Thank you for your response. I've also tried adapting LLaVA to the video domain, but in my experiments, the performance in open-ended QA is significantly lower compared to PLLaVA. Could you share some tips or tricks? I trained the model for just one epoch and am wondering if the lower performance is related to the number of training epochs or if there are other factors involved?
The other confusion is in figure 9 about training LoRA with video samples. In this figure, the best result of the 7B model on VCG is not also higher than 3.0. Could you clear up my confusion?
Hello, thanks for your great work. I read your paper, but have some confusion about the results. I find the VCG scores are not higher than 3.0 in your ablation studies, but the performance of the 7B model is 3.12. Could you help me?