Open junmin98 opened 4 years ago
Hello @junmin98. I did this way.
First you have to save up to 16 frames at least, it's the default temporal window.
import cv2
# we create the video capture object cap
cap = cv2.VideoCapture(0)
# Frame's list for HAR
full_clip = []
... your code ...
while True:
ret, frame = cap.read()
# Save frame's list
full_clip.append(frame)
if len(full_clip)>16:
del full_clip[0]
... your code ...
# show us frame with detection
cv2.imshow("Web cam input", frame)
if cv2.waitKey(25) & 0xFF == ord("q"):
cv2.destroyAllWindows()
break
cap.release()
cv2.destroyAllWindows()
Then, when you have at least 16 frames in your list, you will do the spatial transform (temporal isn't needed unless you need to do the time synchronization used in the training, more of that later), apply the model and get best class output.
import torch
import torch.nn.functional as F
from PIL import Image
import numpy as np
def get_normalize_method(mean, std, no_mean_norm, no_std_norm):
if no_mean_norm:
if no_std_norm:
return Normalize([0, 0, 0], [1, 1, 1])
else:
return Normalize([0, 0, 0], std)
else:
if no_std_norm:
return Normalize(mean, [1, 1, 1])
else:
return Normalize(mean, std)
def get_spatial_transform(opt):
normalize = get_normalize_method(opt.mean, opt.std, opt.no_mean_norm,
opt.no_std_norm)
spatial_transform = [Resize(opt.sample_size)]
if self.opt.inference_crop == 'center':
spatial_transform.append(CenterCrop(opt.sample_size))
spatial_transform.append(ToTensor())
spatial_transform.extend([ScaleValue(opt.value_scale), normalize])
spatial_transform = Compose(spatial_transform)
return spatial_transform
def preprocessing(clip, spatial_transform):
# Applying spatial transformations
if spatial_transform is not None:
spatial_transform.randomize_parameters()
# Before applying spatial transformation you need to convert your frame into PIL Image format (its not the best way, but works)
clip = [spatial_transform(Image.fromarray(np.uint8(img)).convert('RGB')) for img in clip]
# Rearange shapes to fit model input
clip = torch.stack(clip, 0).permute(1, 0, 2, 3)
clip = torch.stack((clip,), 0)
return clip
def predict(clip, model, spatial_transform, classes):
# Set mode eval mode
model.eval()
# do some preprocessing steps
clip = preprocessing(clip, spatial_transform)
# don't calculate grads
with torch.no_grad():
# apply model to input
outputs = model(clip)
# apply softmax and move from gpu to cpu
outputs = F.softmax(outputs, dim=1).cpu()
# get best class
score, class_prediction = torch.max(outputs, 1)
# As model outputs a index class, if you have the real class list you can get the output class name
# something like this: classes = ['jump', 'talk', 'walk', ...]
if classes != None:
return score[0], classes[class_prediction[0]]
return score[0], class_prediction[0]
When you are training a model you can define if you want 16 consecutives frames or if you want to skip some like, take 1 frame, skip 3, take another one, .... But there is something that is not considered, the processing time that your machine you take to apply all this code and the cv2 code to get frame. In real-time human action detect lets say you will get a 10 fps, but in the training time we worked with 30 fps videos, so if you just thing in frame terms you will not be synchronizing the temporal window you had used to train your model.
@guilhermesurek @kenshohara does this pre-processing step still holds true if I am using the resnet50 model which was fine-tuned on the UCF101 dataset?. For training and inference, we use the jpg images generated for each video and I thought something similar would have to be done for any video for testing.
This looks like a decent way to test the videos. We only have to create an array or file with class names for mapping. The class order should be the same as the class order present while training the model.
I am not sure if the model uses only spatial transform data or temporal transformations also while training and am hence confused.
I am trying to predict the output on a video outside of the dataset. Can you tell me the steps to perform that?
Please advise
Let me know if I need to provide more information like training parameters
@guilhermesurek Hi, Thank you for your sharing, But I had another problem: Do you know how to get model() I had error every time --> frame = model() Can you share the full code for webcam? Thank you in advance!
@Purav-Zumkhawala, i will try to explain, let me know if I was'nt very clear.
does this pre-processing step still holds true if I am using the resnet50 model which was fine-tuned on the UCF101 dataset?. For training and inference, we use the jpg images generated for each video and I thought something similar would have to be done for any video for testing.
Yes, but not all. You need to normalize the input the same way you normalize in the training phase. And then, convert to tensor. Other spatial transformations done in the training are not necessary. And yes, you need to do something similar if you need to test on other videos, but the original code doesn't have this functionality.
This looks like a decent way to test the videos. We only have to create an array or file with class names for mapping. The class order should be the same as the class order present while training the model.
This is the way when testing on video files.
I am not sure if the model uses only spatial transform data or temporal transformations also while training and am hence confused.
The model used both. I'm using the standards for testing from the main code.
I am trying to predict the output on a video outside of the dataset. Can you tell me the steps to perform that?
You just need to check your temporal window, in other words, the fps from your videos. If you use a 120 fps video for testing it would have 4 times more frames than the original ones and you will get, probably, wrong labels.
@TonyLi-Shu try to use this https://github.com/guilhermesurek/computer-vision-framework and please, all credits to Keshohara and his team.
I want to predict the class label using webcam. so I got the frame of webcam image, then put it to model.
but I got the error
I think that i should spatial and temporal transform.. however if i do spatial transform
frame = spatial_transform(frame)
can you tell me how to use the model to predict class label with webcam frame image.