Open onlyonewater opened 4 years ago
Following up on the issue of IoU: Your evaluation is done on the rescaled 0-200 frame range, not on the original video length. I find the calculated IoU can be very different especially when the video is very short. It's because you didn't linearly rescale the ground truth start/end time to 0~200, but instead sample the frames and mark whether they are from the target segment. It introduces a large error when there are few frames. This makes me skeptical of the result numbers reported in the paper.
Hi, @ChenyunWu, thank you for your reply, yeah, I am also skeptical about the result. And I found two papers to follow it, they are Jointly Cross- and Self-Modal Graph Attention Network for Query-Based Moment Localization and Fine-grained Iterative Attention Network for TemporalLanguage Localization in Videos, which are accepted by MM 2020. The paper makes me very confused.
As for IoU, I think plus 1 in the formula because of the condition that start==end. In this condition, the frames should be counted as 1 rather than 0. Moreover, the frame length such as [0,199] should be counted as 199-0+1=200 frames.
That's what I think, maybe it's not correct totally.
Thank you very much for your code, but I am a little confused about your code. First, when you calculate the IOU, why do you add one to the numerator and the denominator? Second, the start_frame variable in TACOSGCN class was confusing. the fps variable means 1/interval, then timestamp means the start time, so, the two variables should not be equal to start_frame when multiplied. The same problem exists in the ActivityNetGCN class. Third, in the paper, you say to use 4 windows widths of [8, 16, 32, 64] for TACoS, but in your code, why you use [6, 18, 32] for TACos? And where is the feature of your sliding windows? Can you provide it? In particular the Activitynet dataset.