Closed dwsmart32 closed 6 months ago
Thank you for acknowledging the value of our work! The instruction data generation code will be released soon (within one or two weeks).
In response to the question: This is a very interesting question! Certainly, both architecture and data are important. However, I think currently data is the major bottleneck. I believe that the temporal understanding ability of Video LLMs, even based on existing architectures, could be significantly enhanced by a high-quality temporal-centric instruction tuning dataset.
BTW: It is worth noting that our instruction data generation pipeline still heavily rely on manual annotation of meta-information and manual refinement of generated instructions to guarantee the quality. Thus it is not suitable to construct large-sclae instruction tuning dataset.
Hi!! I happened to come across your paper and was very impressed by it. I want to express my gratitude for the work you've been doing. I am eagerly looking forward to the instruction data generation code that has mentioned. Could you please share when you plan to release this code? It would be really helpful if it were realsed sooner to my project.
And also I wanna ask to the author about an intuition that is related to your work.
"Do you think it could solve the problme that your paper tackles(vlm is weak at temporal reasoning) to better capture temporal relationships if fine-tuning like instruction tuning are performed using your dataset(or much more refined or polished temporal benchmark in the future) in terms of making well-known off-the-shelf video models(like the videochat series that try to capture temporal information) to understand temporal information ,? Or do you believe that there needs to be a further a huge leap in the architecture of off-the-shelf video models beyond the level of the training dataset?
Thanks for your rely in advance.