hassony2 / kinetics_i3d_pytorch

Inflated i3d network with inception backbone, weights transfered from tensorflow
MIT License
528 stars 115 forks source link

transfer learning with custom dataset that has different video size #9

Closed yana25 closed 6 years ago

yana25 commented 6 years ago

Hi @hassony2 ,

First of all, thank you for posting your code!

I have a small question, i'm trying to do transfer learning to the model using my own dataset, but the difference is that my input shape is much different from kinetics or ucf101, each sample in my dataset has 64 frames, each frame is 600x600 with 3 channels with 8 classes. I tried to just to finetune the last Unit3Dpy but it didn't do well, do you think i'm missing out something?

Yana

hassony2 commented 6 years ago

Hi,

Thank you for the interest in this repo :).

I have fine-tunes with 64 frames as well and this improved my results a little compared to finetuning with 32 frames (although I went back to 32 frames because it took less time and the difference in score was not significant).

Have you changed the code here ? If you use 600*600 images, it will probably have additional space dimensions at this level, a simple solution would be to average those activations accross the spatial dimensions, is that what you did ?

Also, notice that there is a softmax layer in the model if you are training with pytorch's CrossEntropyLoss, make sure to remove this softmax (otherwise it will hurt the training a lot, as you would have two consecutive softmax layers {as CrossEntropyLoss also performs softmax}).

Let me know if this helped !

All the best,

Yana

yana25 commented 6 years ago

First, thank you for your quick answer.

Yes I tried to change the code, but i'm getting that the output of the last conv3d(conv3d_0c_1x1) is: torch.Size([1, 8, 7, 13, 13]) This is my input size: (1, 3, 64, 600, 600) and I changed conv3d_0c_1x1 to be as follows:

    i3d_rgb.conv3d_0c_1x1 = Unit3Dpy(
            in_channels=1024,
            out_channels=8,
            kernel_size=(1, 1, 1),
            activation=None,
            use_bias=True,
            use_bn=False).cuda()

from what i understand torch.squeeze is just reducing dimensions containing 1's, in my case the output of the conv3d_0c_1x1 does not consist any 1's (except for the batch size dim). how do you think i should solve this?

Yana

hassony2 commented 6 years ago

Your output layer looks good to me. As for the additional spatial dimension in your features, you can average instead of squeezing, so the lines would look like :

out = out.mean(4)  # Average first spatial dimension
out = out.mean(3)  # Average second spatial dimension
out = out.mean(2)  # Average time dimension

instead of

out = out.squeeze(3)
out = out.squeeze(3) 
out = out.mean(2)

This way you simply average over the additional spatial dimensions to reduce your tensor to the size expected by conv3d_0c_1x1.

(When the inputs are smaller, for instance 224*224 pixels, the spatial dimensions are 1 and mean and squeeze are equivalent)

yana25 commented 6 years ago

Thank you so much! you really helped me!

hassony2 commented 6 years ago

I'm glad this helped you. :)

I wish you luck for your further experiments !