Open jpainam opened 2 months ago
Hi,
for the object-centric approach we used the features provided by Accurate-Interpretable-VAD.
For the frame-centric approach we use the Hiera backbone. For 16 consecutive RGB frames of shape [1, 3, 16, 224, 224]
we extract a d-dimensional feature vector just before the classification head. In the case of Hiera-Large
we obtain a feature vector of [1, 1152]
for the 16 frames. We use the center frame as the ground-truth label. To get the features and ground-truth labels for the whole video clip we extract the features in a rolling window fashion.
Can you be more explicit about what you mean by rolling window fashion?
Given 64
consecutive frames. Do you build your windows in this fashion
[0, 16], [16, 32], [32, 48], [48, 64]
or [0, 16], [1, 17], [2, 18], [3, 19], ... [48, 64]
Both refer to a windowing
approach.
Hi,
for the object-centric approach we used the features provided by Accurate-Interpretable-VAD.
For the frame-centric approach we use the Hiera backbone. For 16 consecutive RGB frames of shape
[1, 3, 16, 224, 224]
we extract a d-dimensional feature vector just before the classification head. In the case ofHiera-Large
we obtain a feature vector of[1, 1152]
for the 16 frames. We use the center frame as the ground-truth label. To get the features and ground-truth labels for the whole video clip we extract the features in a rolling window fashion.
Hello!Have you solved your problem? I'm also reproducing the effect on the Avenue dataset but I'm stuck because I don't have the appropriate processing code for it.
@Haifu-Ye I decided to go with the first approach - no-overlapping frame
[0, 16], [16, 32], [32, 48], [48, 64]
and use the label of the middle frame as the clip(window
)'s label. i.e., label of the frame at start_frame + 8
I'm using UCF Crime
But the performance I get are far from the ones reported in the paper.
@Haifu-Ye I decided to go with the first approach - no-overlapping frame
[0, 16], [16, 32], [32, 48], [48, 64]
and use the label of the middle frame as the clip(window
)'s label. i.e., label of the frame atstart_frame + 8
I'm using
UCF Crime
But the performance I get are far from the ones reported in the paper.
hi!I want to try to use the shanghaitech dataset, but it seems that the format of the dataset in extract_shanghaitech_frames.py is not the same as that of the official shanghaitech dataset, however, the download link for the shanghaitech dataset in the script doesn't work, and I'd like to know how other people I would like to know how other people solve this problem.
Hi. Thanks for releasing the code.
Can you provide details in the readme about the dataset preparation? I see a
get_dataset
that generates atoy_dataset
with shape(10000, 2)
, while extracting feature fromUCF_Crime
will likely give me(N, 16, 1152)
.N
is the number of frames