Closed yiheo closed 3 years ago
The intended here in temporal activity detection, which means making predictions per-frame. So, the given mAP is for frame-wise predictions.
Do you have any plans to add another one?
What do you mean?
Thanks for reply. I means, do you have plans to add code about video-mAP evaluation? or Can you guide me how to edit the code for video-mAP values?
In that case, the task would become recognition (instead of detection). If you still want to do it,
For labels (one-hot): take max across the temporal axis to get video level labels For logits: correct thing would be to replace the final Linear layers to make video-level predictions (after average pooling over both spatio-temporal axes). Implementation should already be there, but you have to retrain, at least the last layers. Easy thing would be to consider max over temporal axis and use them as video-level predictions
Hi. It's me again. In your paper and code, mAP was used for performance comparison.
Thanks.