Why the input shape is (NT,C,H,W)?

mit-han-lab / temporal-shift-module

[ICCV 2019] TSM: Temporal Shift Module for Efficient Video Understanding

https://arxiv.org/abs/1811.08383

MIT License

2.07k stars 417 forks source link

Why the input shape is (NT,C,H,W)? #153

Open zzwei1 opened 4 years ago

zzwei1 commented 4 years ago

Hi, I wonna use TSM on my own dataset, which is a video-like input(each gesture have 32 frames,so my input shape is (N,C,T,H,W)). But when I use a 2D conv backbone (such as resnet50), it needs a 4 dimensional input. So what should I do to merge my own input to a 4-D input? If I use x.view(NT,C,H,W), my data got the 4-D input, but the label size is still N, so here comes the mismatch. I don't know how to solve the problem.

Fritskee commented 3 years ago

Did you convert all your videos to a set of frames?