Open avijit9 opened 4 years ago
There is no threshold. Once the final T-CAM (t x num_class) is computed by the net, we do the top-k pooling over time and get a k x num_class vector, which is then temporally averaged. The resulting vector of size num_class is passed through a softmax to obtain the classwise scores of the video.
And then how do you find out which classes are present in the video from the scores?
The softmax was for the mAP computation. For finding the classes present, we don't perform the softmax above. Instead take all the labels whose top-k mean is greater than 0 as categories present in the video.
How to make prediction of a video? What is the threshold you choose usually? I am talking about the following line in the paper