ikuinen / CMIN_moment_retrieval

Cross-Modal Interaction Networks for Query-Based Moment Retrieval in Videos
86 stars 20 forks source link

Some questions about the code #13

Open onlyonewater opened 4 years ago

onlyonewater commented 4 years ago

Thank you very much for your code, but I am a little confused about your code. First, when you calculate the IOU, why do you add one to the numerator and the denominator? fb41f0628958c69be24b40d5c3a8222 Second, the start_frame variable in TACOSGCN class was confusing. the fps variable means 1/interval, then timestamp means the start time, so, the two variables should not be equal to start_frame when multiplied. The same problem exists in the ActivityNetGCN class. e4fb1b9a904b6a458ebc1de674d246a Third, in the paper, you say to use 4 windows widths of [8, 16, 32, 64] for TACoS, but in your code, why you use [6, 18, 32] for TACos? And where is the feature of your sliding windows? Can you provide it? In particular the Activitynet dataset. f6371beb99d4ce96c29e31f0211aa64

ChenyunWu commented 3 years ago

Following up on the issue of IoU: Your evaluation is done on the rescaled 0-200 frame range, not on the original video length. I find the calculated IoU can be very different especially when the video is very short. It's because you didn't linearly rescale the ground truth start/end time to 0~200, but instead sample the frames and mark whether they are from the target segment. It introduces a large error when there are few frames. This makes me skeptical of the result numbers reported in the paper.

onlyonewater commented 3 years ago

Hi, @ChenyunWu, thank you for your reply, yeah, I am also skeptical about the result. And I found two papers to follow it, they are Jointly Cross- and Self-Modal Graph Attention Network for Query-Based Moment Localization and Fine-grained Iterative Attention Network for TemporalLanguage Localization in Videos, which are accepted by MM 2020. The paper makes me very confused.

JeRainXiong commented 3 years ago

As for IoU, I think plus 1 in the formula because of the condition that start==end. In this condition, the frames should be counted as 1 rather than 0. Moreover, the frame length such as [0,199] should be counted as 199-0+1=200 frames.

That's what I think, maybe it's not correct totally.