Hi, I wonna use TSM on my own dataset, which is a video-like input(each gesture have 32 frames,so my input shape is (N,C,T,H,W)).
But when I use a 2D conv backbone (such as resnet50), it needs a 4 dimensional input. So what should I do to merge my own input to a 4-D input?
If I use x.view(NT,C,H,W), my data got the 4-D input, but the label size is still N, so here comes the mismatch.
I don't know how to solve the problem.
Hi, I wonna use TSM on my own dataset, which is a video-like input(each gesture have 32 frames,so my input shape is (N,C,T,H,W)). But when I use a 2D conv backbone (such as resnet50), it needs a 4 dimensional input. So what should I do to merge my own input to a 4-D input? If I use x.view(NT,C,H,W), my data got the 4-D input, but the label size is still N, so here comes the mismatch. I don't know how to solve the problem.