kkahatapitiya / Coarse-Fine-Networks

Code for our CVPR 2021 paper "Coarse-Fine Networks for Temporal Activity Detection in Videos"
MIT License
55 stars 7 forks source link

frame-mAP or video-mAP? #6

Closed yiheo closed 3 years ago

yiheo commented 3 years ago

Hi. It's me again. In your paper and code, mAP was used for performance comparison.

  1. Is it frame-mAP? or video-mAP?
  2. Do you have any plans to add another one?

Thanks.

kkahatapitiya commented 3 years ago

The intended here in temporal activity detection, which means making predictions per-frame. So, the given mAP is for frame-wise predictions.

Do you have any plans to add another one?

What do you mean?

yiheo commented 3 years ago

Thanks for reply. I means, do you have plans to add code about video-mAP evaluation? or Can you guide me how to edit the code for video-mAP values?

kkahatapitiya commented 3 years ago

In that case, the task would become recognition (instead of detection). If you still want to do it,

For labels (one-hot): take max across the temporal axis to get video level labels For logits: correct thing would be to replace the final Linear layers to make video-level predictions (after average pooling over both spatio-temporal axes). Implementation should already be there, but you have to retrain, at least the last layers. Easy thing would be to consider max over temporal axis and use them as video-level predictions