boxes filtration step - Githubissues

ebadawy commented 8 years ago

I'm a little bit confused on how the filtration step is done upon the different actions, at the end I've boxes for all 21 actions, how should I get the right action label and when?

If I'm getting it right, based on L74 the filtration step is done on a prior knowledge of the correct label, but what if I want to test that on any other video, would that still possible? if so how should I get the right action?

@gkioxari , please let me know.

gkioxari commented 8 years ago

The line you are referring to has to do with evaluation. During evaluation, each box (proposal) comes with scores for ALL actions. These scores come from a CNN or whatever you want (the ROC curve code is not specific to CNNs). The ROC curve is computed for each action at a time. So, if you have 21 actions, you should end up with 21 ROC curves. L74 indicates that at evaluation time, a detection is considered correct if and only if it's overlap with a ground truth region is >=0.5, if it is not a duplicate detection and if the action (for which we are running the eval for) is the ground truth action.

I am not sure what your confusion is exactly, but I think the above should be of help. Let me know if it doesn't cover it.

ebadawy commented 8 years ago

Thanks @gkioxari for your reply,

I understand this part of ROC curve calculation, but I was asking about what if I want to test another "real" video in which there are actions that I don't know about in some part of it, how should I decide which action(s) are present in the video when all actions produce boxes (tubes) for this test video?

Also about the scores for each action, I think it somehow should be involved in action label selection but I don't know how, as if I chose to select the largest score there is some videos that is not correct (should it be false positive examples?) and what about if there is no actions at all, is there a threshold for scoring? please let me know if I'm getting it wrong.

gkioxari commented 8 years ago

When testing on a new (unseen) video, action tubes return a set of predictions, one for each frame. Those predictions come from the procedure described in the paper, i.e. region proposals are being run through the spatial and motion-CNNs and are linked based on their scores and spatial overlap. The result gives an action tube for the same video. As you mention, action tubes for all actions are predicted, but the score (see paper) is the measurement about the confidence. So, an action tube for Running that has a low score indicates that Running is probably not a present action in the video. To make it more clear, this is the same as in object detection. Where every region produces scores for all objects, but the scores define the confidence that this object lies within the region.

You are confusing the process of creating a test action tube with evaluation. At evaluation time, which is the stage at which you decide whether the predicted action tube is correct or not, you need to have a label for the video. So the video needs to come with ground truth information.

ebadawy commented 8 years ago

Oh, I think I got it now, Thanks @gkioxari for you help.

gkioxari / ActionTubes

boxes filtration step #5