Closed Jmh0527 closed 1 week ago
It really depends on how you extract the feature. For example,
0,1,...,8,...,14,15
, then offset_frames should be 8. The I3D feature used in ActionFormer is based on such a setting.0,...,2,...,4
, then it should be 2. Basically, the offset_frames
means the actual index of the center frame when extracting the first clip feature. In our codebase, we also extract the VideoMAEv2 feature for THUMOS (see here), and the offset_frames is set to stride//2
, which is 4//2=2.We will release the feature extraction code soon. Once you see the code, you will understand it more clearly.
If I extract features like below, does it mean "snippet_stride=2, clip_length=16, frame_interval=1", and the offset_frames should be stride//2, which is 2//2=1 ?
num_videos = len(vid_list)
for idx, vid_name in enumerate(vid_list):
url = os.path.join(args.save_path, vid_name.split('.')[0] + '.npy')
if os.path.exists(url):
continue
video_path = os.path.join(args.data_path, vid_name)
vr = video_loader(video_path)
feature_list = []
for start_idx in start_idx_range(len(vr)):
# start_idx_range is range(0, num_frames - 15, 2)
data = vr.get_batch(np.arange(start_idx, start_idx + 16)).asnumpy()
frame = torch.from_numpy(data)
frame_q = transform(frame)
input_data = frame_q.unsqueeze(0).cuda()
with torch.no_grad():
feature = model.forward_features(input_data)
feature_list.append(feature.cpu().numpy())
# [N, C]
np.save(url, np.vstack(feature_list))
print(f'[{idx} / {num_videos}]: save feature on {url}')
In the above code, the offset_frame
should be 8.
The frame index of the first clip is 0,...,15
. After mean pooling, you will get one feature, and the corresponding timestamp of this feature should be the middle frame index, which is 8.
I notice that self.offset_frames are set to 8 in the "ThumosPaddingDataset'. Is this a generic setting with clip_len=16? I am using videomae2 to extract video features with configuration of "snippet_stride=4, clip_length=16, frame_interval=1", and I also need to set offset_frames to 8 in the "ThumosPaddingDataset' traning with videomamba ?