video features - Githubissues

happyharrycn / actionformer_release

Code release for ActionFormer (ECCV 2022)

MIT License

420 stars 77 forks source link

video features #8

Closed wolfworld6 closed 2 years ago

wolfworld6 commented 2 years ago

sorry to bother you，could you tell me some details about negative samples for the video features？

happyharrycn commented 2 years ago

I am not totally sure I understand your question here.

The video features are extracted for all frames using a sliding temporal window, i.e., a single feature vector is extracted from a local window using a consecutive chunk of frames, and features from all local windows (potentially overlapping) form a sequence as the input to the model. The assignment of positive and negative samples will be based on the action annotations, as we described in our tech report.

wolfworld6 commented 2 years ago

I am not totally sure I understand your question here.

The video features are extracted for all frames using a sliding temporal window, i.e., a single feature vector is extracted from a local window using a consecutive chunk of frames, and features from all local windows (potentially overlapping) forms a sequence as the input to the model. The assignment of positive and negative samples will be based on the action annotations, as we described in our tech report.

i mean using different sliding temporal window when training？

tzzcl commented 2 years ago

For training, we use a fixed sliding window, e.g., 2304 for THUMOS14 and EPIC-Kitchens. During training, we will random select a fixed length of input features from each input feature sequence to form the batch. For ActivityNet, since all videos are resized into equal length, we directly feed the whole sequence into the model.

wolfworld6 commented 2 years ago

For training, we use a fixed sliding window, e.g., 2304 for THUMOS14 and EPIC-Kitchens. During training, we will random select a fixed length of input features from each input feature sequence to form the batch. For ActivityNet, since all videos are resized into equal length, we directly feed the whole sequence into the model.

thx, vvvvvvvvvery much！