tychen-SJTU / MECD-Benchmark

[NeurIPS'24 spotlight] MECD: Unlocking Multi-Event Causal Discovery in Video Reasoning
MIT License
17 stars 0 forks source link

Question about the multi-event dataset #30

Open lzc2017 opened 4 days ago

lzc2017 commented 4 days ago

Thank you for your meaningful work. I would like to ask that how the events defined in the video data? In other words, how to segment a video into multi-event segments? Thanks

tychen-SJTU commented 3 days ago

Thank you for the attention and recognition of the MECD work.

The events in this article can be understood as a complete action sequence, consistent with the concept of events in dense video captioning (DVC) tasks. Currently, the VGCM model inputs ground truth timestamps and corresponding captions during the training and inference period. This paper's main contribution and focus are on inferring the causal relations between video events, rather than the complete preprocessing pipeline.

At the same time, we believe that by using state-of-the-art models for dense video captioning (e.g., Vid2Seq, Gemini Pro, GPT4o, VideoChat2), even when only inputting a complete video with no additional annotations, it is possible to obtain both visual and textual information about each sub-event in the video by utilizing their strong ability for basic video tasks, thereby enabling the subsequent discovery of causal relations.

If you still have any concerns about our paper after reading the response, please feel free to contact us by adding comments.

Best regards, Tieyuan Chen, SJTU