kenshohara / 3D-ResNets-PyTorch

3D ResNets for Action Recognition (CVPR 2018)
MIT License
3.88k stars 930 forks source link

question about the 'Temporal duration of inputs' #8

Closed sophiazy closed 6 years ago

sophiazy commented 6 years ago

Hi@kenshohara , in the opts.py ,whether I can change temporal duration of inputs in parser.add_argument('--sample_duration', default=16, type=int, help='Temporal duration of inputs'),like 32 frames,64 frames,etc? have you take the similar experiments? I really appreciate for your reply, Thanks.

kenshohara commented 6 years ago

Yes, you can control the temporal duration using sample_duration option (--sample_duration 64). But it cause some errors in global average pooling because the duration affects to the size of feature maps. I plan to merge work branch, which already fix the bugs, in this week. Using this code, I did some experiments of 64 frame inputs. Please wait a few days.

sophiazy commented 6 years ago

Thanks for your work and reply.I want to use multi scale slide window to trimmed the long video(background,action),so maybe I need to set the sample_duration=512,when you fix the bug,whether I can set the sample_duration randomly if the memory allowed?

kenshohara commented 6 years ago

You can set the sample_duration = 512 if your machine has sufficient memory.

FYI, when I set sample_duration = 64, and trained a model on Kinetics using batch size 128, 8 GPUs are required. It maybe difficult to use 512 frames.

sophiazy commented 6 years ago

Indeed ,trained a model required so many GPU and memory ,If I set the batchsize as 32 or 16,I don't know if the model trained not well and the accuracy will go down?

kenshohara commented 6 years ago

I have not tried such small batch size. So, please do the experiments on your own. But I think the improvements by using large duration are larger than the degradations by using small batch size.

sophiazy commented 6 years ago

Firstly,I plan to use kinetics_val dataset to test the trained model which is provided, next,I plan to use 3D-Resnet to train and test on my own dataset to get the classification score.

On your github home page ,I find you upload the "3D-ResNets-PyTorch" and "video-classification-3d-cnn-pytorch".Is there any difference between them ,what should I choose?

Best Regards

kenshohara commented 6 years ago

You should use the code in this repository. “video-classification” is repository for recognition or feature extraction tool. If you want to train and evaluate some models, you should use the code in this repository. To train and test on your dataset, you have to write your dataset class, similar to kinetics.py.

sophiazy commented 6 years ago

Hi@kenshohara I meet another question.I want to use multi-scale slide window (16,32,64,128,256)to trimmed long video, whether I can input all the trimmed segment( multi-scale segment) to the 3D resnet for one time,finally output all trimmed segment class confidence score?

kenshohara commented 6 years ago

If you use our models, you can input one scale segment at one time. Then, you obtain multiple class scores based on multi-scale inputs. You can use the multiple scores by averaging or other functions. Note that, if you use our pretrained models, you can use only 16 frame inputs. To use other scale inputs, you have to train such model on your own, or you have to drop frames and change the scale to 16 frames.

If you want to input multi-scale segments simultaneously, you have to implement such model on your own.

sophiazy commented 6 years ago

Thanks your instruction!

sophiazy commented 6 years ago

Hi@kenshohara I find the current repository have updated a few days ago,Is it the final version?Have you fixed the bug when the temporal duration of inputs set 64? Best Regards!

kenshohara commented 6 years ago

Yes, the bug was fixed on the current version!

sophiazy commented 6 years ago

Ok~,Thanks

sophiazy commented 6 years ago

when I finish reading this respository code almost,I find the n_samples_for_each_video=1 at any time,if one video consist of 300 frames,set the sample_duration=16,so I am confused that why not sample more clips in one video by steps according to the self length.finally fusion all clips score for one video?I am not sure my idea is true,I am appreciate for your advice!

kenshohara commented 6 years ago

n_samples_for_each_video is always 1 in training. In validation step, n_samples_for_each_video is set by opt.n_val_samples. To avoid the computational time of validation step, I set the default value of opt.n_val_samples as 3. In test step, our code use non-overlap sliding window.

sophiazy commented 6 years ago

in the training stage,the more training data will be better. I am still confused that why not sample more clips in one video by steps according to the self length to generate more training data? I plan to use 3D Resnet model on Thumos2014 dataset,because Thumos2014 dataset is pretty small ,so I plan to sample more clips in the untrimmed video to get more training data.Is it resonable?

kenshohara commented 6 years ago

Because the training is repeated until the validation loss is saturated, the training samples are augmented even n_samples_for_each_video=1. The samples are randomly cropped spatio-temporally in every epoch.

sophiazy commented 6 years ago

I understand your analysis this time, I did not understand the code thoroughly. thanks a lot!

kenshohara commented 6 years ago

The data augmentation is performed by dataset class, transforms, and dataloader. This tutorial may be useful to understand the code. http://pytorch.org/tutorials/beginner/data_loading_tutorial.html

sophiazy commented 6 years ago

@kenshohara when I use your network architecture that you provided ,is it must to set sample_size as 112,why choose 112? is it because the input of fc layers is fixed?
I find the Kinetics original image size is different,like,427x240,136x240,320x240,etc. whether you are read the original image ,then crop them with size of 112?

kenshohara commented 6 years ago

I just follow the setting of C3D. The inputs are cropped with size 112.

sophiazy commented 6 years ago

@kenshohara I read the C3D code,author resize the original image to 171x128,then croped with 112.In the 3D Resnet,whether I need to resize original image to 171x128?

kenshohara commented 6 years ago

I'm not sure. Please confirm on yourself. I resized original frame because of the limitation of my storage size. My specific setting is described in my paper.

sophiazy commented 6 years ago

@kenshohara sorry to be a bothered again.when I finish to read the paper"Can Spatiotemporal 3D CNNs Retrace the History of 2D CNNs and ImageNet?",I find finetuning on ucf101 dataset achieved better performance using pretained model on kinetics. Now ,I want to finetuning on my own dataset for action detection task ,because of the pretrained model 's sample_duration is default set 16 frames,,whether I can set the input sample_duration as 32 frames?

kenshohara commented 6 years ago

If you want to use 32 frame inputs, you first have to train models on Kinetics from scratch. Because input sizes of a CNN are fixed, one model can utilize one fixed size.

An additional method is dropping frames. If you drop every two frames of 32 frame inputs, you can get 16 frame inputs.