mchengny / RWF2000-Video-Database-for-Violence-Detection

A large scale video database for violence detection, which has 2,000 video clips containing violent or non-violent behaviours.
402 stars 85 forks source link

Predict video using pre-trained model error #10

Closed ProPythoner67 closed 3 years ago

ProPythoner67 commented 3 years ago

Hey, First of all thanks for sharing this repo! I am trying to predict video using the pre-trained model. Here is what I've done: sgd = SGD(lr=0.01, decay=1e-6, momentum=0.9, nesterov=True) model = load_model(model_path) model.compile(optimizer=sgd, loss='categorical_crossentropy', metrics=['accuracy']) frames = Video2Npy(file_path='some_file_path.mp4') for frame in frames: preds = model.predict(np.expand_dims(frame, axis=0))

And from the prediction line I got this error: ValueError: Input 0 is incompatible with layer model_1: expected shape=(None, 64, 224, 224, 5), found shape=(None, 224, 224, 5)

It's looking like the preprocess for the frames doesn't give the expected shape, what am I doing wrong? Also I've search for example for how to predict video using the pre-trained model in this repository, and I couldn't find.

Thanks!

mchengny commented 3 years ago

Hi, the input of "model.predict() " must be a batch of data. Therefore, you have to reshape your single video clip to the size of [1, 64, 224, 224, 5]. "1" means that current batch size is 1, 64 is the number of frames for this pre-trained model, 224 is the size of single frame, and 5 means 3 rgb channels + 2 optical flows.

ProPythoner67 commented 3 years ago

Thanks for the fast response, I've tried to reshape my single video clip but after all my attempts I didn't managed to get the right shape. Do you have a code example for the full preprocess that I should run for single video clip in order to predict with the pre-trained model?

mchengny commented 3 years ago

Sorry to say that I also lost the codes used before. You mentioned that you utilized the function(Video2Npy) to preprocess a video, while the length of the returned array will be equal to the original length of the input video, not fixed 64 frames. Our pre-trained model only receives a fixed length (64). So, you must make your input in the shape of [1, 64, 224, 224, 5].

mchengny commented 3 years ago

Also, you could try to put the data in our data generator and then feed it to the network. (ref to here)

ProPythoner67 commented 3 years ago

So if i understood right, the pre-trained model gives one prediction per 64 frames?

mchengny commented 3 years ago

Correctly, for longer video (>64 frames), you can sample 64 frames sparsely or utilize a sliding window algorithm to process each sliced cliip.

ProPythoner67 commented 3 years ago

Ok great, I've grouped my frames into 64 frame groups and now i'm getting this prediction result for example: [[9.9999881e-01 1.2446051e-06]] What does it represents? does the first value represent the probability for violent or for non-violent?(Im using the pre-trained model)

mchengny commented 3 years ago

Hi, you could check the printed information after initializing the data generator. If "violent" is assigned to class 0, then pred[0] means "violent". Otherwise, pred[0] means "non-violent".

ProPythoner67 commented 3 years ago

I've succeeded to use the DataGenerator but it seems like this class returns (from getitem) 1 batch of 64 frames for the whole video, so I do get the predict but its only for 64 frames, how can i get the other 64 frames groups? dosen't the DataGenerator class should return them also? or maybe i'm not using the class correctly?

mchengny commented 3 years ago

The data generator will sparsely sample 64 frames from an input video (no matter how long it is), so in real -case inference you need to implement a sliding window algorithm by yourself.