kenshohara / 3D-ResNets-PyTorch

3D ResNets for Action Recognition (CVPR 2018)
MIT License
3.9k stars 932 forks source link

Asking about the using of 3D ResNet on video sequence #209

Closed glmanhtu closed 4 years ago

glmanhtu commented 4 years ago

Hello,

I'm new in this kind of 3D Convolution, so, I'm trying to understand how does this works. My dataset (UNBC McMaster) is including some videos that contains sequence of frames. For each frame, we have one pain intensity level. Now, I want to use 3D ResNet to predict pain level as regression problem. So, let says we have a sequence including 32 frames, which mean I have 32 labels for this sequence. Normally, with CNN + LSTM, I will use CNN to extract features and then put it through LSTM, take the output and label of the last frame to compute loss. So, for 3D ResNet, should I take the output of the model and the label of last frame to calculate loss ?

guilhermesurek commented 4 years ago

Hello @glmanhtu , with 3D CNN you process all frames (all temporal window) together, the kernels are 3D (temporal x height x length) sometimes 4D (plus channel - RGB), see if the image below clear things. Substitute "action" for the predictions you want to get. image Answering your question, it will have just one output, so you will compute loss with it. Ref: Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset [arXiv:1705.07750v3]

glmanhtu commented 4 years ago

Hello @guilhermesurek . Thank for your answer. I understand the case of action recognition, when we have only one output for one sequence. However, in my case, I have the label for each frame, which is pain intensity - a number. So, one sequence of images will have a corresponding one sequence of labels. In the case of LSTM, I can use the label of the last frame to compute loss, and then it will back propagate through time to the whole sequence. But for 3D CNN, I'm not sure that doing the same is the correct way or not.

guilhermesurek commented 4 years ago

Hum, ok, understood. In this case, you can set the temporal window you want (let's say 5 frames) and use the score of the fifth frame, then slide the window (2, 3, 4, 5, 6) and use the score of the sixth, and so on. This way, the net may learn to understand the past and apply to the present, like a lstm but with one shot. Does it make sense?

glmanhtu commented 4 years ago

Seem to make sense, thank a lot. As far as I know then this kind of CNN is tend to over-fitting, so, for a dataset with around 20,000 sequences then which network is the best do you think ? I see we have several here: https://github.com/kenshohara/3D-ResNets-PyTorch/tree/master/models

koyakuwe commented 4 years ago

@glmanhtu Why don't you use 2DCNN instead? Any spesific reasons why you want to learn "motion" as well?

glmanhtu commented 4 years ago

@koyakuwe there are some limitation of 2D CNN, the maximum amount of information you can learn from a single image. So, if you want something that fast enough and the performance is acceptable, then go with 2D CNN. But if you don't really mind about the speed and want something as accurate as possible, then we have to dig deeper in the movement of the video.

koyakuwe commented 4 years ago

@glmanhtu I understand this, what I want to know is whether your task involve motion recognition? From your description, for me it seems like videos consist a "packed" version of still images. Is there "motion" available in videos? Also, each frame has its own label and loss is computed based on prediction per image

guilhermesurek commented 4 years ago

I think it could help, pain is related to how you make expressions and for how long you make then. But you need to make a good architecture to not do overffiting, lots of parameters. Maybe you can start with pre trained parameters and replace the fc layers. Try small ones like resnet 10 or 18.

glmanhtu commented 4 years ago

@glmanhtu I understand this, what I want to know is whether your task involve motion recognition? From your description, for me it seems like videos consist a "packed" version of still images. Is there "motion" available in videos? Also, each frame has its own label and loss is computed based on prediction per image

@koyakuwe It is something like when we predict an image, we want to take into account the information from the past as well, i.e. increasing/decreasing, which could help your prediction a little bit more accurate.

@guilhermesurek thank for the recommend, I will try those network and see how it works.