Closed MSungK closed 6 months ago
Within these two datasets, each video may be annotated with one or multiple segments. We chose only a subset of videos that have multiple segment annotations. If a video has only one segment annotation, LLM can only inquire about the time or event of that particular segment when generating high-quality QA dialogues in stage3, which are already learned in stage2. Therefore, we didn't utilize videos with single-segment annotations.
However, we didn't attempt training on the entire dataset. Feel free to share your findings.
Thanks for your impressive paper. In the paper, you said "In this stage, we select a subset from ActivityNet Captions [12] and DiDeMo [1] datasets" in stage 3. However, I thought that the modal might have better performance when trained with the total dataset which manually annotated. And, I couldn't find any reason about that phrase. Can you explain the reason for me? Thanks for reading my question.