ju-chen / Efficient-Prompt

MIT License
182 stars 12 forks source link

Questions on action localization #10

Open hanoonaR opened 1 year ago

hanoonaR commented 1 year ago

Hi,

Thank you for sharing your fantastic work. I have a few questions related to the action localization application.

1) In your work you mentioned that you follow a two-stage pipeline: class-agnostic localization, and then action classification. In the class-agnostic proposal generation step, it is understood that a generic detector is trained from scratch using clip image features (instead of I3D features). Could you please explain if the detector is trained in a class-agnostic way in your implementation? or if the class predictions are just discarded?

2) In step 1, it is mentioned in the supplementary that you "utilise three parallel prediction heads to determine" the localization. Can you please explain why three heads are used?

3) In section 3.2 training loss, it is explained that for the localization task, the mean pool of dense features from stage1 proposals is used to obtain v_i. So in the second step in action classification, is the model (prompting) trained for classification? If so, will the training data be original dataset videos sampled at 10 fps and length of 256 frames (following AFSD) and corresponding action class labels?

4) My last question is, could you provide insights on why the off-the-shelf detector is trained on the clip image features, instead of purely using off-the-shelf detections?

I would really appreciate it if you could provide the answers at the earliest. Thank you.

Coder-Liuu commented 8 months ago

@hanoonaR Hi, can we exchange, I'm training a model in THUMOS dataset 50:50 segmentation where I3D is used for action detection features and CLIP features for action classification, but I can only achieve a performance of 10mAP, do you know how to improve the performance?