sudo-Boris / mr-Blip

Official Implementation of "The Surprising Effectiveness of Multimodal Large Language Models for Video Moment Retrieval"
BSD 3-Clause "New" or "Revised" License
24 stars 0 forks source link

Question about mAP evaluation on TAL task #1

Closed sming256 closed 2 days ago

sming256 commented 3 weeks ago

Thanks for your amazing work! I have two questions regarding your implementation on the TAL task.

  1. The TAL tasks require the model not only to predict the start/end times of candidate predictions but also the action category. However, in Mr.Blip, the action category is provided as known information in the text prompt on TAL. I wonder, is this appropriate?
  2. When I use your released checkpoint to infer on TAL, I can reproduce your results, which is around 51.11 mAP under this codebase. However, when I use the predicted JSON file val_epochbest.json and evaluate it under other standard TAL evaluation code, such as here and here, I can only get 25.93% mAP. This indicates a difference between your implementation and previous TAL implementations. Can you check what the issue might be to ensure the correct evaluation on the TAL task?

The output file from Mr.Blip I use is val_epochbest.json, and the converted JSON file (combined with the GT category) is mrblip_converted_detection.json. The following is the evaluation result.

Number of predictions: 5115
Fixed threshold for tiou score: [0.5, 0.55, 0.6, 0.65, 0.7, 0.75, 0.8, 0.85, 0.9, 0.95]
Average-mAP: 25.93 (%)
mAP at tIoU 0.50 is 41.20%
mAP at tIoU 0.55 is 37.96%
mAP at tIoU 0.60 is 35.12%
mAP at tIoU 0.65 is 32.25%
mAP at tIoU 0.70 is 29.06%
mAP at tIoU 0.75 is 25.80%
mAP at tIoU 0.80 is 21.81%
mAP at tIoU 0.85 is 17.25%
mAP at tIoU 0.90 is 12.40%
mAP at tIoU 0.95 is 6.45%
sudo-Boris commented 3 weeks ago

Hi @sming256, thank you for the feedback!

  1. You are right. Assuming the action class as given and using it as input is a mistake. I sincerely apologize for overlooking this (significant) detail... I have already updated the leaderboard on Papers with code, will update the code base accordingly, rerun the experiments under the correct TAL setting, and report the updated numbers.
  2. For the mAP implementation, I leverage the official implementation for QVHighlights. Since the TAL task is different, and requires a different mAP implementation, the reported numbers will be wrong/ not comparable.

I misunderstood the TAL task when reading the UnLoc paper and now understood what they did when looking deeper into their code.

I sincerely apologize for this mistake...

Thank you for pointing it out!

sming256 commented 3 weeks ago

Thank you for your quick reply! I appreciate your clarification, and I am amazed by the results on the other moment retrieval tasks. Great work!

sudo-Boris commented 3 weeks ago

Thank you very much! We will hopefully soon be able to report the correct numbers on TAL :)

sudo-Boris commented 2 days ago

We have updated the paper and will soon update the code base! Thank you for your interest in our work and for pointing out this bug.