Closed wolfworld6 closed 2 years ago
I am not totally sure I understand your question here.
The video features are extracted for all frames using a sliding temporal window, i.e., a single feature vector is extracted from a local window using a consecutive chunk of frames, and features from all local windows (potentially overlapping) form a sequence as the input to the model. The assignment of positive and negative samples will be based on the action annotations, as we described in our tech report.
I am not totally sure I understand your question here.
The video features are extracted for all frames using a sliding temporal window, i.e., a single feature vector is extracted from a local window using a consecutive chunk of frames, and features from all local windows (potentially overlapping) forms a sequence as the input to the model. The assignment of positive and negative samples will be based on the action annotations, as we described in our tech report.
i mean using different sliding temporal window when training?
For training, we use a fixed sliding window, e.g., 2304 for THUMOS14 and EPIC-Kitchens. During training, we will random select a fixed length of input features from each input feature sequence to form the batch. For ActivityNet, since all videos are resized into equal length, we directly feed the whole sequence into the model.
For training, we use a fixed sliding window, e.g., 2304 for THUMOS14 and EPIC-Kitchens. During training, we will random select a fixed length of input features from each input feature sequence to form the batch. For ActivityNet, since all videos are resized into equal length, we directly feed the whole sequence into the model.
thx, vvvvvvvvvery much!
sorry to bother you,could you tell me some details about negative samples for the video features?