naraysa / 3c-net

Weakly-supervised Action Localization
49 stars 9 forks source link

How do you make predictions for videos? #7

Open avijit9 opened 4 years ago

avijit9 commented 4 years ago

How to make prediction of a video? What is the threshold you choose usually? I am talking about the following line in the paper

After training the 3C-Net, the CLS module (see Fig. 2
and Eq. 2) is used to compute the action-class scores (pmf)
at the video-level using the final T-CAM, for the action classification task
naraysa commented 4 years ago

There is no threshold. Once the final T-CAM (t x num_class) is computed by the net, we do the top-k pooling over time and get a k x num_class vector, which is then temporally averaged. The resulting vector of size num_class is passed through a softmax to obtain the classwise scores of the video.

avijit9 commented 4 years ago

And then how do you find out which classes are present in the video from the scores?

naraysa commented 4 years ago

The softmax was for the mAP computation. For finding the classes present, we don't perform the softmax above. Instead take all the labels whose top-k mean is greater than 0 as categories present in the video.