Questions on action localization

Hi,

Thank you for sharing your fantastic work. I have a few questions related to the action localization application.

1) In your work you mentioned that you follow a two-stage pipeline: class-agnostic localization, and then action classification. In the class-agnostic proposal generation step, it is understood that a generic detector is trained from scratch using clip image features (instead of I3D features). Could you please explain if the detector is trained in a class-agnostic way in your implementation? or if the class predictions are just discarded?

2) In step 1, it is mentioned in the supplementary that you "utilise three parallel prediction heads to determine" the localization. Can you please explain why three heads are used?

3) In section 3.2 training loss, it is explained that for the localization task, the mean pool of dense features from stage1 proposals is used to obtain v_i. So in the second step in action classification, is the model (prompting) trained for classification? If so, will the training data be original dataset videos sampled at 10 fps and length of 256 frames (following AFSD) and corresponding action class labels?

4) My last question is, could you provide insights on why the off-the-shelf detector is trained on the clip image features, instead of purely using off-the-shelf detections?

I would really appreciate it if you could provide the answers at the earliest. Thank you.

ju-chen / Efficient-Prompt

Questions on action localization #10