Dataset section - Githubissues

jakubmicorek / MULDE-Multiscale-Log-Density-Estimation-via-Denoising-Score-Matching-for-Video-Anomaly-Detection

26 stars 1 forks source link

Dataset section #2

Open jpainam opened 2 months ago

jpainam commented 2 months ago

Hi. Thanks for releasing the code.

Can you provide details in the readme about the dataset preparation? I see a get_dataset that generates a toy_dataset with shape (10000, 2), while extracting feature from UCF_Crime will likely give me (N, 16, 1152). N is the number of frames

jakubmicorek commented 2 months ago

Hi,

for the object-centric approach we used the features provided by Accurate-Interpretable-VAD.

For the frame-centric approach we use the Hiera backbone. For 16 consecutive RGB frames of shape [1, 3, 16, 224, 224] we extract a d-dimensional feature vector just before the classification head. In the case of Hiera-Large we obtain a feature vector of [1, 1152] for the 16 frames. We use the center frame as the ground-truth label. To get the features and ground-truth labels for the whole video clip we extract the features in a rolling window fashion.

jpainam commented 2 months ago

Can you be more explicit about what you mean by rolling window fashion? Given 64 consecutive frames. Do you build your windows in this fashion

[0, 16], [16, 32], [32, 48], [48, 64] or
[0, 16], [1, 17], [2, 18], [3, 19], ... [48, 64]

Both refer to a windowing approach.

Haifu-Ye commented 2 months ago

Hi,

for the object-centric approach we used the features provided by Accurate-Interpretable-VAD.

For the frame-centric approach we use the Hiera backbone. For 16 consecutive RGB frames of shape [1, 3, 16, 224, 224] we extract a d-dimensional feature vector just before the classification head. In the case of Hiera-Large we obtain a feature vector of [1, 1152] for the 16 frames. We use the center frame as the ground-truth label. To get the features and ground-truth labels for the whole video clip we extract the features in a rolling window fashion.

Hello！Have you solved your problem？ I'm also reproducing the effect on the Avenue dataset but I'm stuck because I don't have the appropriate processing code for it.

jpainam commented 2 months ago

@Haifu-Ye I decided to go with the first approach - no-overlapping frame [0, 16], [16, 32], [32, 48], [48, 64] and use the label of the middle frame as the clip(window)'s label. i.e., label of the frame at start_frame + 8

I'm using UCF Crime

But the performance I get are far from the ones reported in the paper.

Haifu-Ye commented 4 days ago

@Haifu-Ye I decided to go with the first approach - no-overlapping frame [0, 16], [16, 32], [32, 48], [48, 64] and use the label of the middle frame as the clip(window)'s label. i.e., label of the frame at start_frame + 8

I'm using UCF Crime

But the performance I get are far from the ones reported in the paper.

hi！I want to try to use the shanghaitech dataset, but it seems that the format of the dataset in extract_shanghaitech_frames.py is not the same as that of the official shanghaitech dataset, however, the download link for the shanghaitech dataset in the script doesn't work, and I'd like to know how other people I would like to know how other people solve this problem.