huangb23 / VTimeLLM

[CVPR'2024 Highlight] Official PyTorch implementation of the paper "VTimeLLM: Empower LLM to Grasp Video Moments".
https://arxiv.org/pdf/2311.18445.pdf
Other
226 stars 11 forks source link

Why did you use the only subset? #23

Closed MSungK closed 6 months ago

MSungK commented 6 months ago

Thanks for your impressive paper. In the paper, you said "In this stage, we select a subset from ActivityNet Captions [12] and DiDeMo [1] datasets" in stage 3. However, I thought that the modal might have better performance when trained with the total dataset which manually annotated. And, I couldn't find any reason about that phrase. Can you explain the reason for me? Thanks for reading my question.

huangb23 commented 6 months ago

Within these two datasets, each video may be annotated with one or multiple segments. We chose only a subset of videos that have multiple segment annotations. If a video has only one segment annotation, LLM can only inquire about the time or event of that particular segment when generating high-quality QA dialogues in stage3, which are already learned in stage2. Therefore, we didn't utilize videos with single-segment annotations.

However, we didn't attempt training on the entire dataset. Feel free to share your findings.